Search Results for author: Johannes von Oswald

Found 15 papers, 10 papers with code

Linear Transformers are Versatile In-Context Learners

no code implementations21 Feb 2024 Max Vladymyrov, Johannes von Oswald, Mark Sandler, Rong Ge

Recent research has demonstrated that transformers, particularly linear attention models, implicitly execute gradient-descent-like algorithms on data provided in-context during their forward inference step.

Gated recurrent neural networks discover attention

no code implementations4 Sep 2023 Nicolas Zucchet, Seijin Kobayashi, Yassir Akram, Johannes von Oswald, Maxime Larcher, Angelika Steger, João Sacramento

In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers.

In-Context Learning

Transformers learn in-context by gradient descent

1 code implementation15 Dec 2022 Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, Max Vladymyrov

We start by providing a simple weight construction that shows the equivalence of data transformations induced by 1) a single linear self-attention layer and by 2) gradient-descent (GD) on a regression loss.

In-Context Learning Meta-Learning +1

Disentangling the Predictive Variance of Deep Ensembles through the Neural Tangent Kernel

no code implementations18 Oct 2022 Seijin Kobayashi, Pau Vilimelis Aceituno, Johannes von Oswald

Identifying unfamiliar inputs, also known as out-of-distribution (OOD) detection, is a crucial property of any decision making process.

Decision Making Inductive Bias +1

Random initialisations performing above chance and how to find them

1 code implementation15 Sep 2022 Frederik Benzing, Simon Schug, Robert Meier, Johannes von Oswald, Yassir Akram, Nicolas Zucchet, Laurence Aitchison, Angelika Steger

Neural networks trained with stochastic gradient descent (SGD) starting from different random initialisations typically find functionally very similar solutions, raising the question of whether there are meaningful differences between different SGD solutions.

The least-control principle for local learning at equilibrium

1 code implementation4 Jul 2022 Alexander Meulemans, Nicolas Zucchet, Seijin Kobayashi, Johannes von Oswald, João Sacramento

As special cases, they include models of great current interest in both neuroscience and machine learning, such as deep neural networks, equilibrium recurrent neural networks, deep equilibrium models, or meta-learning.

BIG-bench Machine Learning Meta-Learning

A contrastive rule for meta-learning

1 code implementation4 Apr 2021 Nicolas Zucchet, Simon Schug, Johannes von Oswald, Dominic Zhao, João Sacramento

Humans and other animals are capable of improving their learning performance as they solve related tasks from a given problem domain, to the point of being able to learn from extremely limited data.

Meta-Learning

Posterior Meta-Replay for Continual Learning

3 code implementations NeurIPS 2021 Christian Henning, Maria R. Cervera, Francesco D'Angelo, Johannes von Oswald, Regina Traber, Benjamin Ehret, Seijin Kobayashi, Benjamin F. Grewe, João Sacramento

We offer a practical deep learning implementation of our framework based on probabilistic task-conditioned hypernetworks, an approach we term posterior meta-replay.

Continual Learning

Neural networks with late-phase weights

2 code implementations ICLR 2021 Johannes von Oswald, Seijin Kobayashi, Alexander Meulemans, Christian Henning, Benjamin F. Grewe, João Sacramento

The largely successful method of training neural networks is to learn their weights using some variant of stochastic gradient descent (SGD).

Ranked #70 on Image Classification on CIFAR-100 (using extra training data)

Image Classification

Cannot find the paper you are looking for? You can Submit a new open access paper.