no code implementations • 21 Feb 2024 • Max Vladymyrov, Johannes von Oswald, Mark Sandler, Rong Ge
Recent research has demonstrated that transformers, particularly linear attention models, implicitly execute gradient-descent-like algorithms on data provided in-context during their forward inference step.
1 code implementation • 22 Dec 2023 • Simon Schug, Seijin Kobayashi, Yassir Akram, Maciej Wołczyk, Alexandra Proca, Johannes von Oswald, Razvan Pascanu, João Sacramento, Angelika Steger
This allows us to relate the problem of compositional generalization to that of identification of the underlying modules.
no code implementations • 11 Sep 2023 • Johannes von Oswald, Eyvind Niklasson, Maximilian Schlegel, Seijin Kobayashi, Nicolas Zucchet, Nino Scherrer, Nolan Miller, Mark Sandler, Blaise Agüera y Arcas, Max Vladymyrov, Razvan Pascanu, João Sacramento
Transformers have become the dominant model in deep learning, but the reason for their superior performance is poorly understood.
no code implementations • 4 Sep 2023 • Nicolas Zucchet, Seijin Kobayashi, Yassir Akram, Johannes von Oswald, Maxime Larcher, Angelika Steger, João Sacramento
In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers.
1 code implementation • 15 Dec 2022 • Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, Max Vladymyrov
We start by providing a simple weight construction that shows the equivalence of data transformations induced by 1) a single linear self-attention layer and by 2) gradient-descent (GD) on a regression loss.
no code implementations • 18 Oct 2022 • Seijin Kobayashi, Pau Vilimelis Aceituno, Johannes von Oswald
Identifying unfamiliar inputs, also known as out-of-distribution (OOD) detection, is a crucial property of any decision making process.
1 code implementation • 15 Sep 2022 • Frederik Benzing, Simon Schug, Robert Meier, Johannes von Oswald, Yassir Akram, Nicolas Zucchet, Laurence Aitchison, Angelika Steger
Neural networks trained with stochastic gradient descent (SGD) starting from different random initialisations typically find functionally very similar solutions, raising the question of whether there are meaningful differences between different SGD solutions.
1 code implementation • 4 Jul 2022 • Alexander Meulemans, Nicolas Zucchet, Seijin Kobayashi, Johannes von Oswald, João Sacramento
As special cases, they include models of great current interest in both neuroscience and machine learning, such as deep neural networks, equilibrium recurrent neural networks, deep equilibrium models, or meta-learning.
1 code implementation • NeurIPS 2021 • Johannes von Oswald, Dominic Zhao, Seijin Kobayashi, Simon Schug, Massimo Caccia, Nicolas Zucchet, João Sacramento
We find that patterned sparsity emerges from this process, with the pattern of sparsity varying on a problem-by-problem basis.
1 code implementation • 4 Apr 2021 • Nicolas Zucchet, Simon Schug, Johannes von Oswald, Dominic Zhao, João Sacramento
Humans and other animals are capable of improving their learning performance as they solve related tasks from a given problem domain, to the point of being able to learn from extremely limited data.
no code implementations • ICLR Workshop Learning_to_Learn 2021 • Dominic Zhao, Nicolas Zucchet, Joao Sacramento, Johannes von Oswald
Finding neural network weights that generalize well from small datasets is difficult.
3 code implementations • NeurIPS 2021 • Christian Henning, Maria R. Cervera, Francesco D'Angelo, Johannes von Oswald, Regina Traber, Benjamin Ehret, Seijin Kobayashi, Benjamin F. Grewe, João Sacramento
We offer a practical deep learning implementation of our framework based on probabilistic task-conditioned hypernetworks, an approach we term posterior meta-replay.
2 code implementations • ICLR 2021 • Johannes von Oswald, Seijin Kobayashi, Alexander Meulemans, Christian Henning, Benjamin F. Grewe, João Sacramento
The largely successful method of training neural networks is to learn their weights using some variant of stochastic gradient descent (SGD).
Ranked #70 on Image Classification on CIFAR-100 (using extra training data)
3 code implementations • ICLR 2021 • Benjamin Ehret, Christian Henning, Maria R. Cervera, Alexander Meulemans, Johannes von Oswald, Benjamin F. Grewe
Here, we provide the first comprehensive evaluation of established CL methods on a variety of sequential data benchmarks.
7 code implementations • ICLR 2020 • Johannes von Oswald, Christian Henning, Benjamin F. Grewe, João Sacramento
Artificial neural networks suffer from catastrophic forgetting when they are sequentially trained on multiple tasks.
Ranked #4 on Continual Learning on F-CelebA (10 tasks)