no code implementations • 8 Feb 2024 • Abhishek Panigrahi, Nikunj Saunshi, Kaifeng Lyu, Sobhan Miryoosefi, Sashank Reddi, Satyen Kale, Sanjiv Kumar
RaPTr achieves better pre-training loss for BERT and UL2 language models while requiring 20-33% fewer FLOPs compared to standard training, and is competitive or better than other efficient training methods.
1 code implementation • 3 Jul 2023 • Abhishek Panigrahi, Sadhika Malladi, Mengzhou Xia, Sanjeev Arora
In this work, we propose an efficient construction, Transformer in Transformer (in short, TinT), that allows a transformer to simulate and fine-tune complex models internally during inference (e. g., pre-trained language models).
no code implementations • 14 Mar 2023 • Haoyu Zhao, Abhishek Panigrahi, Rong Ge, Sanjeev Arora
We also show that the Inside-Outside algorithm is optimal for masked language modeling loss on the PCFG-generated data.
1 code implementation • 13 Feb 2023 • Abhishek Panigrahi, Nikunj Saunshi, Haoyu Zhao, Sanjeev Arora
Given the downstream task and a model fine-tuned on that task, a simple optimization is used to identify a very small subset of parameters ($\sim0. 01$% of model parameters) responsible for ($>95$%) of the model's performance, in the sense that grafting the fine-tuned values for just this tiny subset onto the pre-trained model gives performance almost as well as the fine-tuned model.
1 code implementation • 20 May 2022 • Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, Sanjeev Arora
Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD.
no code implementations • 19 May 2022 • Sanjeev Arora, Zhiyuan Li, Abhishek Panigrahi
The current paper mathematically analyzes a new mechanism of implicit regularization in the EoS phase, whereby GD updates due to non-smooth loss landscape turn out to evolve along some deterministic flow on the manifold of minimum loss.
no code implementations • NeurIPS 2021 • Abhishek Panigrahi, Navin Goyal
In contrast to the previous work that could only deal with functions of sequences that are sums of functions of individual tokens in the sequence, we allow general functions.
no code implementations • 21 Oct 2019 • Abhishek Panigrahi, Raghav Somani, Navin Goyal, Praneeth Netrapalli
What enables Stochastic Gradient Descent (SGD) to achieve better generalization than Gradient Descent (GD) in Neural Network training?
no code implementations • ICLR 2020 • Abhishek Panigrahi, Abhishek Shetty, Navin Goyal
In the present paper, we provide theoretical results about the effect of activation function on the training of highly overparametrized 2-layer neural networks.
no code implementations • ACL 2019 • Abhishek Panigrahi, Harsha Vardhan Simhadri, Chiranjib Bhattacharyya
We present an unsupervised method to generate Word2Sense word embeddings that are interpretable {---} each dimension of the embedding space corresponds to a fine-grained sense, and the non-negative value of the embedding along the j-th dimension represents the relevance of the j-th sense to the word.
1 code implementation • 10 Mar 2019 • Suman Kalyan Maity, Abhishek Panigrahi, Sayan Ghosh, Arundhati Banerjee, Pawan Goyal, Animesh Mukherjee
In this paper, we develop a content-cum-user based deep learning framework DeepTagRec to recommend appropriate question tags on Stack Overflow.
no code implementations • ICLR 2018 • Abhishek Panigrahi, Yueru Chen, C. -C. Jay Kuo
We conduct mathematical analysis on the effect of batch normalization (BN) on gradient backpropogation in residual network training, which is believed to play a critical role in addressing the gradient vanishing/explosion problem, in this work.