1 code implementation • 13 Mar 2024 • Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats L. Richter, Quentin Anthony, Timothée Lesort, Eugene Belilovsky, Irina Rish
In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks.
2 code implementations • 8 Aug 2023 • Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, Timothée Lesort
We study the warmup phase of models pre-trained on the Pile (upstream data, 300B tokens) as we continue to pre-train on SlimPajama (downstream data, 297B tokens), following a linear warmup and cosine decay schedule.
1 code implementation • 1 Jun 2023 • Pablo Pernias, Dominic Rampas, Mats L. Richter, Christopher J. Pal, Marc Aubreville
This highly compressed representation of an image provides much more detailed guidance compared to latent representations of language and this significantly reduces the computational requirements to achieve state-of-the-art results.
no code implementations • 26 Nov 2022 • Mats L. Richter, Christopher Pal
By further developing and formalizing the analysis of receptive field expansion in convolutional neural networks, we can predict unproductive layers in an automated manner before ever training a model.
1 code implementation • 23 Jun 2021 • Mats L. Richter, Julius Schöning, Anna Wiedenroth, Ulf Krumnack
When optimizing convolutional neural networks (CNN) for a specific image-based task, specialists commonly overshoot the number of convolutional layers in their designs.
no code implementations • 17 Jun 2021 • Mats L. Richter, Leila Malihi, Anne-Kathrin Patricia Windler, Ulf Krumnack
In this work we explore the information processing inside neural networks using logistic regression probes \cite{probes} and the saturation metric \cite{featurespace_saturation}.
no code implementations • 2 Feb 2021 • Mats L. Richter, Wolf Byttner, Ulf Krumnack, Ludwdig Schallner, Justin Shenk
Fully convolutional neural networks can process input of arbitrary size by applying a combination of downsampling and pooling.
2 code implementations • 15 Jun 2020 • Mats L. Richter, Justin Shenk, Wolf Byttner, Anders Arpteg, Mikael Huss
First, we show that a layer's output can be restricted to the eigenspace of its variance matrix without performance loss.
1 code implementation • 19 Jul 2019 • Justin Shenk, Mats L. Richter, Anders Arpteg, Mikael Huss
We propose a metric, Layer Saturation, defined as the proportion of the number of eigenvalues needed to explain 99% of the variance of the latent representations, for analyzing the learned representations of neural network layers.