Search Results for author: Max Vladymyrov

Found 14 papers, 5 papers with code

Linear Transformers are Versatile In-Context Learners

no code implementations21 Feb 2024 Max Vladymyrov, Johannes von Oswald, Mark Sandler, Rong Ge

Recent research has demonstrated that transformers, particularly linear attention models, implicitly execute gradient-descent-like algorithms on data provided in-context during their forward inference step.

Continual Few-Shot Learning Using HyperTransformers

no code implementations11 Jan 2023 Max Vladymyrov, Andrey Zhmoginov, Mark Sandler

We focus on the problem of learning without forgetting from multiple tasks arriving sequentially, where each task is defined using a few-shot episode of novel or already seen classes.

Class Incremental Learning continual few-shot learning +2

Training trajectories, mini-batch losses and the curious role of the learning rate

no code implementations5 Jan 2023 Mark Sandler, Andrey Zhmoginov, Max Vladymyrov, Nolan Miller

In particular, for Exponential Moving Average (EMA) and Stochastic Weight Averaging we show that our proposed model matches the observed training trajectories on ImageNet.

Transformers learn in-context by gradient descent

1 code implementation15 Dec 2022 Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, Max Vladymyrov

We start by providing a simple weight construction that shows the equivalence of data transformations induced by 1) a single linear self-attention layer and by 2) gradient-descent (GD) on a regression loss.

In-Context Learning Meta-Learning +1

Decentralized Learning with Multi-Headed Distillation

no code implementations CVPR 2023 Andrey Zhmoginov, Mark Sandler, Nolan Miller, Gus Kristiansen, Max Vladymyrov

We study the effects of data and model architecture heterogeneity and the impact of the underlying communication graph topology on learning efficiency and show that our agents can significantly improve their performance compared to learning in isolation.

Fine-tuning Image Transformers using Learnable Memory

1 code implementation CVPR 2022 Mark Sandler, Andrey Zhmoginov, Max Vladymyrov, Andrew Jackson

In this paper we propose augmenting Vision Transformer models with learnable memory tokens.

GradMax: Growing Neural Networks using Gradient Information

1 code implementation ICLR 2022 Utku Evci, Bart van Merriënboer, Thomas Unterthiner, Max Vladymyrov, Fabian Pedregosa

The architecture and the parameters of neural networks are often optimized independently, which requires costly retraining of the parameters whenever the architecture is modified.

HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning

1 code implementation11 Jan 2022 Andrey Zhmoginov, Mark Sandler, Max Vladymyrov

In this work we propose a HyperTransformer, a Transformer-based model for supervised and semi-supervised few-shot learning that generates weights of a convolutional neural network (CNN) directly from support samples.

Few-Shot Image Classification Few-Shot Learning

HyperTransformer: Attention-Based CNN Model Generation from Few Samples

no code implementations29 Sep 2021 Andrey Zhmoginov, Max Vladymyrov, Mark Sandler

In this work we propose a HyperTransformer, a transformer based model that generates all weights of a CNN model directly from the support samples.

Few-Shot Learning

Meta-Learning Bidirectional Update Rules

1 code implementation10 Apr 2021 Mark Sandler, Max Vladymyrov, Andrey Zhmoginov, Nolan Miller, Andrew Jackson, Tom Madams, Blaise Aguera y Arcas

We show that classical gradient-based backpropagation in neural networks can be seen as a special case of a two-state network where one state is used for activations and another for gradients, with update rules derived from the chain rule.

Meta-Learning

No Pressure! Addressing the Problem of Local Minima in Manifold Learning Algorithms

no code implementations NeurIPS 2019 Max Vladymyrov

However, due to a complicated nonconvex objective function, these methods can easily get stuck in local minima and their embedding quality can be poor.

A fast, universal algorithm to learn parametric nonlinear embeddings

no code implementations NeurIPS 2015 Miguel A. Carreira-Perpinan, Max Vladymyrov

This has two advantages: 1) The algorithm is universal in that a specific learning algorithm for any choice of embedding and mapping can be constructed by simply reusing existing algorithms for the embedding and for the mapping.

Dimensionality Reduction

Cannot find the paper you are looking for? You can Submit a new open access paper.