Search Results for author: Jason D. Lee

Found 131 papers, 25 papers with code

REBEL: Reinforcement Learning via Regressing Relative Rewards

no code implementations25 Apr 2024 Zhaolin Gao, Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J. Andrew Bagnell, Jason D. Lee, Wen Sun

While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications including the fine-tuning of generative models.

Dataset Reset Policy Optimization for RLHF

2 code implementations12 Apr 2024 Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Kianté Brantley, Dipendra Misra, Jason D. Lee, Wen Sun

Motivated by the fact that offline preference dataset provides informative states (i. e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution.

Reinforcement Learning (RL)

Horizon-Free Regret for Linear Markov Decision Processes

no code implementations15 Mar 2024 Zihan Zhang, Jason D. Lee, Yuxin Chen, Simon S. Du

A recent line of works showed regret bounds in reinforcement learning (RL) can be (nearly) independent of planning horizon, a. k. a.~the horizon-free bounds.

LEMMA Reinforcement Learning (RL)

Computational-Statistical Gaps in Gaussian Single-Index Models

no code implementations8 Mar 2024 Alex Damian, Loucas Pillaud-Vivien, Jason D. Lee, Joan Bruna

Single-Index Models are high-dimensional regression problems with planted structure, whereby labels depend on an unknown one-dimensional projection of the input via a generic, non-linear, and potentially non-deterministic transformation.

How Well Can Transformers Emulate In-context Newton's Method?

no code implementations5 Mar 2024 Angeliki Giannou, Liu Yang, Tianhao Wang, Dimitris Papailiopoulos, Jason D. Lee

Recent studies have suggested that Transformers can implement first-order optimization algorithms for in-context learning and even second order ones for the case of linear regression.

In-Context Learning regression

How Transformers Learn Causal Structure with Gradient Descent

no code implementations22 Feb 2024 Eshaan Nichani, Alex Damian, Jason D. Lee

The key insight of our proof is that the gradient of the attention matrix encodes the mutual information between tokens.

In-Context Learning

LoRA Training in the NTK Regime has No Spurious Local Minima

1 code implementation19 Feb 2024 Uijeong Jang, Jason D. Lee, Ernest K. Ryu

Low-rank adaptation (LoRA) has become the standard approach for parameter-efficient fine-tuning of large language models (LLM), but our theoretical understanding of LoRA has been limited.

Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark

1 code implementation18 Feb 2024 Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D. Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, Tianlong Chen

In the evolving landscape of natural language processing (NLP), fine-tuning pre-trained Large Language Models (LLMs) with first-order (FO) optimizers like SGD and Adam has become standard.

Benchmarking

BitDelta: Your Fine-Tune May Only Be Worth One Bit

1 code implementation15 Feb 2024 James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai

Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks.

An Information-Theoretic Analysis of In-Context Learning

no code implementations28 Jan 2024 Hong Jun Jeon, Jason D. Lee, Qi Lei, Benjamin Van Roy

Previous theoretical results pertaining to meta-learning on sequences build on contrived assumptions and are somewhat convoluted.

In-Context Learning Meta-Learning

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

1 code implementation19 Jan 2024 Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao

We present two levels of fine-tuning procedures for Medusa to meet the needs of different use cases: Medusa-1: Medusa is directly fine-tuned on top of a frozen backbone LLM, enabling lossless inference acceleration.

Towards Optimal Statistical Watermarking

no code implementations13 Dec 2023 Baihe Huang, Hanlin Zhu, Banghua Zhu, Kannan Ramchandran, Michael I. Jordan, Jason D. Lee, Jiantao Jiao

Key to our formulation is a coupling of the output tokens and the rejection region, realized by pseudo-random generators in practice, that allows non-trivial trade-offs between the Type I error and Type II error.

Optimal Multi-Distribution Learning

no code implementations8 Dec 2023 Zihan Zhang, Wenhao Zhan, Yuxin Chen, Simon S. Du, Jason D. Lee

Focusing on a hypothesis class of Vapnik-Chervonenkis (VC) dimension $d$, we propose a novel algorithm that yields an $varepsilon$-optimal randomized hypothesis with a sample complexity on the order of $(d+k)/\varepsilon^2$ (modulo some logarithmic factor), matching the best-known lower bound.

Fairness

Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking

1 code implementation30 Nov 2023 Kaifeng Lyu, Jikai Jin, Zhiyuan Li, Simon S. Du, Jason D. Lee, Wei Hu

Recent work by Power et al. (2022) highlighted a surprising "grokking" phenomenon in learning arithmetic tasks: a neural net first "memorizes" the training set, resulting in perfect training accuracy but near-random test accuracy, and after training for sufficiently longer, it suddenly transitions to perfect test accuracy.

Learning Hierarchical Polynomials with Three-Layer Neural Networks

no code implementations23 Nov 2023 ZiHao Wang, Eshaan Nichani, Jason D. Lee

Our main result shows that for a large subclass of degree $k$ polynomials $p$, a three-layer neural network trained via layerwise gradient descent on the square loss learns the target $h$ up to vanishing test error in $\widetilde{\mathcal{O}}(d^k)$ samples and polynomial time.

REST: Retrieval-Based Speculative Decoding

1 code implementation14 Nov 2023 Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, Di He

We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation.

Language Modelling Retrieval +1

Settling the Sample Complexity of Online Reinforcement Learning

no code implementations25 Jul 2023 Zihan Zhang, Yuxin Chen, Jason D. Lee, Simon S. Du

While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a ``large-sample'' regime, imposing enormous burn-in cost in order for their algorithms to operate optimally.

reinforcement-learning Reinforcement Learning (RL)

Teaching Arithmetic to Small Transformers

1 code implementation7 Jul 2023 Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos

Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed.

Low-Rank Matrix Completion

Scaling In-Context Demonstrations with Structured Attention

no code implementations5 Jul 2023 Tianle Cai, Kaixuan Huang, Jason D. Lee, Mengdi Wang

However, their capabilities of in-context learning are limited by the model architecture: 1) the use of demonstrations is constrained by a maximum sentence length due to positional embeddings; 2) the quadratic complexity of attention hinders users from using more demonstrations efficiently; 3) LLMs are shown to be sensitive to the order of the demonstrations.

In-Context Learning Sentence

Sample Complexity for Quadratic Bandits: Hessian Dependent Bounds and Optimal Algorithms

no code implementations NeurIPS 2023 Qian Yu, Yining Wang, Baihe Huang, Qi Lei, Jason D. Lee

We consider a fundamental setting in which the objective function is quadratic, and provide the first tight characterization of the optimal Hessian-dependent sample complexity.

valid

Solving Robust MDPs through No-Regret Dynamics

no code implementations30 May 2023 Etash Kumar Guha, Jason D. Lee

Reinforcement Learning is a powerful framework for training agents to navigate different situations, but it is susceptible to changes in environmental dynamics.

Navigate Policy Gradient Methods

Provable Reward-Agnostic Preference-Based Reinforcement Learning

no code implementations29 May 2023 Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee

Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories, rather than explicit reward signals.

reinforcement-learning

Reward Collapse in Aligning Large Language Models

1 code implementation28 May 2023 Ziang Song, Tianle Cai, Jason D. Lee, Weijie J. Su

This insight allows us to derive closed-form expressions for the reward distribution associated with a set of utility functions in an asymptotic regime.

Fine-Tuning Language Models with Just Forward Passes

2 code implementations NeurIPS 2023 Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, Sanjeev Arora

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory.

In-Context Learning Multiple-choice

Provable Offline Preference-Based Reinforcement Learning

no code implementations24 May 2023 Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun

Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE.

reinforcement-learning

Local Optimization Achieves Global Optimality in Multi-Agent Reinforcement Learning

1 code implementation8 May 2023 Yulai Zhao, Zhuoran Yang, Zhaoran Wang, Jason D. Lee

Motivated by the observation, we present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.

LEMMA Multi-agent Reinforcement Learning +1

Can We Find Nash Equilibria at a Linear Rate in Markov Games?

no code implementations3 Mar 2023 Zhuoqing Song, Jason D. Lee, Zhuoran Yang

Second, when both players adopt the algorithm, their joint policy converges to a Nash equilibrium of the game.

Provably Efficient Reinforcement Learning via Surprise Bound

no code implementations22 Feb 2023 Hanlin Zhu, Ruosong Wang, Jason D. Lee

Value function approximation is important in modern reinforcement learning (RL) problems especially when the state space is (infinitely) large.

reinforcement-learning Reinforcement Learning (RL)

Efficient displacement convex optimization with particle gradient descent

no code implementations9 Feb 2023 Hadi Daneshmand, Jason D. Lee, Chi Jin

Particle gradient descent, which uses particles to represent a probability measure and performs gradient descent on particles in parallel, is widely used to optimize functions of probability measures.

Looped Transformers as Programmable Computers

1 code implementation30 Jan 2023 Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, Dimitris Papailiopoulos

We present a framework for using transformer networks as universal computers by programming them with specific weights and placing them in a loop.

In-Context Learning

Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing

no code implementations27 Jan 2023 Jikai Jin, Zhiyuan Li, Kaifeng Lyu, Simon S. Du, Jason D. Lee

It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in training machine learning models.

Incremental Learning

Reconstructing Training Data from Model Gradient, Provably

no code implementations7 Dec 2022 Zihan Wang, Jason D. Lee, Qi Lei

Understanding when and how much a model gradient leaks information about the training sample is an important question in privacy.

Federated Learning Tensor Decomposition

From Gradient Flow on Population Loss to Learning with Stochastic Gradient Descent

no code implementations13 Oct 2022 Satyen Kale, Jason D. Lee, Chris De Sa, Ayush Sekhari, Karthik Sridharan

When these potentials further satisfy certain self-bounding properties, we show that they can be used to provide a convergence guarantee for Gradient Descent (GD) and SGD (even when the paths of GF and GD/SGD are quite far apart).

Retrieval

Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability

1 code implementation30 Sep 2022 Alex Damian, Eshaan Nichani, Jason D. Lee

Our analysis provides precise predictions for the loss, sharpness, and deviation from the PGD trajectory throughout training, which we verify both empirically in a number of standard settings and theoretically under mild conditions.

PAC Reinforcement Learning for Predictive State Representations

no code implementations12 Jul 2022 Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee

We show that given a realizable model class, the sample complexity of learning the near optimal policy only scales polynomially with respect to the statistical complexity of the model class, without any explicit polynomial dependence on the size of the state and observation spaces.

reinforcement-learning Reinforcement Learning (RL)

Neural Networks can Learn Representations with Gradient Descent

no code implementations30 Jun 2022 Alex Damian, Jason D. Lee, Mahdi Soltanolkotabi

Furthermore, in a transfer learning setup where the data distributions in the source and target domain share the same representation $U$ but have different polynomial heads we show that a popular heuristic for transfer learning has a target sample complexity independent of $d$.

Transfer Learning

Computationally Efficient PAC RL in POMDPs with Latent Determinism and Conditional Embeddings

no code implementations24 Jun 2022 Masatoshi Uehara, Ayush Sekhari, Jason D. Lee, Nathan Kallus, Wen Sun

We show our algorithm's computational and statistical complexities scale polynomially with respect to the horizon and the intrinsic dimension of the feature on the observation space.

Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials

1 code implementation8 Jun 2022 Eshaan Nichani, Yu Bai, Jason D. Lee

Next, we show that a wide two-layer neural network can jointly use the NTK and QuadNTK to fit target functions consisting of a dense low-degree term and a sparse high-degree term -- something neither the NTK nor the QuadNTK can do on their own.

Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret Learning in Markov Games

no code implementations3 Jun 2022 Wenhao Zhan, Jason D. Lee, Zhuoran Yang

We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents.

Decision Making

On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias

no code implementations18 May 2022 Itay Safran, Gal Vardi, Jason D. Lee

We study the dynamics and implicit bias of gradient flow (GF) on univariate ReLU neural networks with a single hidden layer in a binary classification setting.

Binary Classification

Nearly Minimax Algorithms for Linear Bandits with Shared Representation

no code implementations29 Mar 2022 Jiaqi Yang, Qi Lei, Jason D. Lee, Simon S. Du

We give novel algorithms for multi-task and lifelong linear bandits with shared representation.

Offline Reinforcement Learning with Realizability and Single-policy Concentrability

no code implementations9 Feb 2022 Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, Jason D. Lee

Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e. g., Bellman-completeness) and the data coverage (e. g., all-policy concentrability).

Offline RL reinforcement-learning +1

Optimization-Based Separations for Neural Networks

no code implementations4 Dec 2021 Itay Safran, Jason D. Lee

Depth separation results propose a possible theoretical explanation for the benefits of deep neural networks over shallower architectures, establishing that the former possess superior approximation capabilities.

Provable Hierarchy-Based Meta-Reinforcement Learning

no code implementations18 Oct 2021 Kurtland Chua, Qi Lei, Jason D. Lee

To address this gap, we analyze HRL in the meta-RL setting, where a learner learns latent hierarchical structure during meta-training for use in a downstream task.

Hierarchical Reinforcement Learning Learning Theory +4

Provable Regret Bounds for Deep Online Learning and Control

no code implementations15 Oct 2021 Xinyi Chen, Edgar Minasyan, Jason D. Lee, Elad Hazan

The theory of deep learning focuses almost exclusively on supervised learning, non-convex optimization using stochastic gradient descent, and overparametrized neural networks.

Second-order methods

Towards General Function Approximation in Zero-Sum Markov Games

no code implementations ICLR 2022 Baihe Huang, Jason D. Lee, Zhaoran Wang, Zhuoran Yang

In the {coordinated} setting where both players are controlled by the agent, we propose a model-based algorithm and a model-free algorithm.

Going Beyond Linear RL: Sample Efficient Neural Function Approximation

no code implementations NeurIPS 2021 Baihe Huang, Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei, Runzhe Wang, Jiaqi Yang

While the theory of RL has traditionally focused on linear function approximation (or eluder dimension) approaches, little is known about nonlinear RL with neural net approximations of the Q functions.

Reinforcement Learning (RL)

Optimal Gradient-based Algorithms for Non-concave Bandit Optimization

no code implementations NeurIPS 2021 Baihe Huang, Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei, Runzhe Wang, Jiaqi Yang

This work considers a large family of bandit problems where the unknown underlying reward function is non-concave, including the low-rank generalized linear bandit problems and two-layer neural network with polynomial activation bandit problem.

A Short Note on the Relationship of Information Gain and Eluder Dimension

no code implementations6 Jul 2021 Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei

Eluder dimension and information gain are two widely used methods of complexity measures in bandit and reinforcement learning.

LEMMA reinforcement-learning +1

Near-Optimal Linear Regression under Distribution Shift

no code implementations23 Jun 2021 Qi Lei, Wei Hu, Jason D. Lee

Transfer learning is essential when sufficient data comes from the source domain, with scarce labeled data from the target domain.

regression Transfer Learning

Label Noise SGD Provably Prefers Flat Global Minimizers

no code implementations NeurIPS 2021 Alex Damian, Tengyu Ma, Jason D. Lee

In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to.

Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence

no code implementations24 May 2021 Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D. Lee, Yuejie Chi

These can often be accounted for via regularized RL, which augments the target value function with a structure-promoting regularizer.

Reinforcement Learning (RL)

How Fine-Tuning Allows for Effective Meta-Learning

no code implementations NeurIPS 2021 Kurtland Chua, Qi Lei, Jason D. Lee

Representation learning has been widely studied in the context of meta-learning, enabling rapid learning of new tasks through shared representations.

Few-Shot Learning Representation Learning

Bilinear Classes: A Structural Framework for Provable Generalization in RL

no code implementations19 Mar 2021 Simon S. Du, Sham M. Kakade, Jason D. Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, Ruosong Wang

The framework incorporates nearly all existing models in which a polynomial sample complexity is achievable, and, notably, also includes new models, such as the Linear $Q^*/V^*$ model in which both the optimal $Q$-function and the optimal $V$-function are linear in some known feature space.

MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch Optimization for Deployment Constrained Reinforcement Learning

no code implementations23 Feb 2021 DiJia Su, Jason D. Lee, John M. Mulvey, H. Vincent Poor

We consider a setting that lies between pure offline reinforcement learning (RL) and pure online RL called deployment constrained RL in which the number of policy deployments for data sampling is limited.

Reinforcement Learning (RL) Uncertainty Quantification

A Theory of Label Propagation for Subpopulation Shift

no code implementations22 Feb 2021 Tianle Cai, Ruiqi Gao, Jason D. Lee, Qi Lei

In this work, we propose a provably effective framework for domain adaptation based on label propagation.

Domain Adaptation Generalization Bounds

Provably Efficient Policy Optimization for Two-Player Zero-Sum Markov Games

no code implementations17 Feb 2021 Yulai Zhao, Yuandong Tian, Jason D. Lee, Simon S. Du

Policy-based methods with function approximation are widely used for solving two-player zero-sum games with large state and/or action spaces.

Policy Gradient Methods Vocal Bursts Valence Prediction

How to Characterize The Landscape of Overparameterized Convolutional Neural Networks

1 code implementation NeurIPS 2020 Yihong Gu, Weizhong Zhang, Cong Fang, Jason D. Lee, Tong Zhang

With the help of a new technique called {\it neural network grafting}, we demonstrate that even during the entire training process, feature distributions of differently initialized networks remain similar at each layer.

Agnostic $Q$-learning with Function Approximation in Deterministic Systems: Near-Optimal Bounds on Approximation Error and Sample Complexity

no code implementations NeurIPS 2020 Simon S. Du, Jason D. Lee, Gaurav Mahajan, Ruosong Wang

The current paper studies the problem of agnostic $Q$-learning with function approximation in deterministic systems where the optimal $Q$-function is approximable by a function in the class $\mathcal{F}$ with approximation error $\delta \ge 0$.

Q-Learning

Beyond Lazy Training for Over-parameterized Tensor Decomposition

no code implementations NeurIPS 2020 Xiang Wang, Chenwei Wu, Jason D. Lee, Tengyu Ma, Rong Ge

We show that in a lazy training regime (similar to the NTK regime for neural networks) one needs at least $m = \Omega(d^{l-1})$, while a variant of gradient descent can find an approximate tensor when $m = O^*(r^{2. 5l}\log d)$.

Tensor Decomposition

Impact of Representation Learning in Linear Bandits

no code implementations ICLR 2021 Jiaqi Yang, Wei Hu, Jason D. Lee, Simon S. Du

For the finite-action setting, we present a new algorithm which achieves $\widetilde{O}(T\sqrt{kN} + \sqrt{dkNT})$ regret, where $N$ is the number of rounds we play for each bandit.

Representation Learning

How Important is the Train-Validation Split in Meta-Learning?

no code implementations12 Oct 2020 Yu Bai, Minshuo Chen, Pan Zhou, Tuo Zhao, Jason D. Lee, Sham Kakade, Huan Wang, Caiming Xiong

A common practice in meta-learning is to perform a train-validation split (\emph{train-val method}) where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split.

Meta-Learning

Sanity-Checking Pruning Methods: Random Tickets can Win the Jackpot

1 code implementation NeurIPS 2020 Jingtong Su, Yihang Chen, Tianle Cai, Tianhao Wu, Ruiqi Gao, Li-Wei Wang, Jason D. Lee

In this paper, we conduct sanity checks for the above beliefs on several recent unstructured pruning methods and surprisingly find that: (1) A set of methods which aims to find good subnetworks of the randomly-initialized network (which we call "initial tickets"), hardly exploits any information from the training data; (2) For the pruned networks obtained by these methods, randomly changing the preserved weights in each layer, while keeping the total number of preserved weights unchanged per layer, does not affect the final performance.

Network Pruning

Generalized Leverage Score Sampling for Neural Networks

no code implementations NeurIPS 2020 Jason D. Lee, Ruoqi Shen, Zhao Song, Mengdi Wang, Zheng Yu

Leverage score sampling is a powerful technique that originates from theoretical computer science, which can be used to speed up a large number of fundamental questions, e. g. linear regression, linear programming, semi-definite programming, cutting plane method, graph sparsification, maximum matching and max-flow.

Learning Theory regression

Predicting What You Already Know Helps: Provable Self-Supervised Learning

no code implementations NeurIPS 2021 Jason D. Lee, Qi Lei, Nikunj Saunshi, Jiacheng Zhuo

Self-supervised representation learning solves auxiliary prediction tasks (known as pretext tasks) without requiring labeled data to learn useful semantic representations.

Representation Learning Self-Supervised Learning

Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy

no code implementations NeurIPS 2020 Edward Moroshko, Suriya Gunasekar, Blake Woodworth, Jason D. Lee, Nathan Srebro, Daniel Soudry

We provide a detailed asymptotic study of gradient flow trajectories and their implicit optimization bias when minimizing the exponential loss over "diagonal linear networks".

General Classification

Modeling from Features: a Mean-field Framework for Over-parameterized Deep Neural Networks

no code implementations3 Jul 2020 Cong Fang, Jason D. Lee, Pengkun Yang, Tong Zhang

This new representation overcomes the degenerate situation where all the hidden units essentially have only one meaningful hidden unit in each middle layer, and further leads to a simpler representation of DNNs, for which the training objective can be reformulated as a convex optimization problem via suitable re-parameterization.

Towards Understanding Hierarchical Learning: Benefits of Neural Representations

no code implementations NeurIPS 2020 Minshuo Chen, Yu Bai, Jason D. Lee, Tuo Zhao, Huan Wang, Caiming Xiong, Richard Socher

When the trainable network is the quadratic Taylor model of a wide two-layer network, we show that neural representation can achieve improved sample complexities compared with the raw input: For learning a low-rank degree-$p$ polynomial ($p \geq 4$) in $d$ dimension, neural representation requires only $\tilde{O}(d^{\lceil p/2 \rceil})$ samples, while the best-known sample complexity upper bound for the raw input is $\tilde{O}(d^{p-1})$.

Convergence of Meta-Learning with Task-Specific Adaptation over Partial Parameters

no code implementations NeurIPS 2020 Kaiyi Ji, Jason D. Lee, Yingbin Liang, H. Vincent Poor

Although model-agnostic meta-learning (MAML) is a very successful algorithm in meta-learning practice, it can have high computational cost because it updates all model parameters over both the inner loop of task-specific adaptation and the outer-loop of meta initialization training.

Meta-Learning

Shape Matters: Understanding the Implicit Bias of the Noise Covariance

1 code implementation15 Jun 2020 Jeff Z. HaoChen, Colin Wei, Jason D. Lee, Tengyu Ma

We show that in an over-parameterized setting, SGD with label noise recovers the sparse ground-truth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense solutions with large norms.

Distributed Estimation for Principal Component Analysis: an Enlarged Eigenspace Analysis

no code implementations5 Apr 2020 Xi Chen, Jason D. Lee, He Li, Yun Yang

To abandon this eigengap assumption, we consider a new route in our analysis: instead of exactly identifying the top-$L$-dim eigenspace, we show that our estimator is able to cover the targeted top-$L$-dim population eigenspace.

Steepest Descent Neural Architecture Optimization: Escaping Local Optimum with Signed Neural Splitting

no code implementations23 Mar 2020 Lemeng Wu, Mao Ye, Qi Lei, Jason D. Lee, Qiang Liu

Recently, Liu et al.[19] proposed a splitting steepest descent (S2D) method that jointly optimizes the neural parameters and architectures based on progressively growing network structures by splitting neurons into multiple copies in a steepest descent fashion.

Few-Shot Learning via Learning the Representation, Provably

no code implementations ICLR 2021 Simon S. Du, Wei Hu, Sham M. Kakade, Jason D. Lee, Qi Lei

First, we study the setting where this common representation is low-dimensional and provide a fast rate of $O\left(\frac{\mathcal{C}\left(\Phi\right)}{n_1T} + \frac{k}{n_2}\right)$; here, $\Phi$ is the representation function class, $\mathcal{C}\left(\Phi\right)$ is its complexity measure, and $k$ is the dimension of the representation.

Few-Shot Learning Representation Learning

Kernel and Rich Regimes in Overparametrized Models

1 code implementation20 Feb 2020 Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, Nathan Srebro

We provide a complete and detailed analysis for a family of simple depth-$D$ models that already exhibit an interesting and meaningful transition between the kernel and rich regimes, and we also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.

Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity

no code implementations17 Feb 2020 Simon S. Du, Jason D. Lee, Gaurav Mahajan, Ruosong Wang

2) In conjunction with the lower bound in [Wen and Van Roy, NIPS 2013], our upper bound suggests that the sample complexity $\widetilde{\Theta}\left(\mathrm{dim}_E\right)$ is tight even in the agnostic setting.

Q-Learning

Neural Temporal-Difference Learning Converges to Global Optima

no code implementations NeurIPS 2019 Qi Cai, Zhuoran Yang, Jason D. Lee, Zhaoran Wang

Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning.

Q-Learning Reinforcement Learning (RL)

When Does Non-Orthogonal Tensor Decomposition Have No Spurious Local Minima?

no code implementations22 Nov 2019 Maziar Sanjabi, Sina Baharlouei, Meisam Razaviyayn, Jason D. Lee

We study the optimization problem for decomposing $d$ dimensional fourth-order Tensors with $k$ non-orthogonal components.

Tensor Decomposition

SGD Learns One-Layer Networks in WGANs

no code implementations ICML 2020 Qi Lei, Jason D. Lee, Alexandros G. Dimakis, Constantinos Daskalakis

Generative adversarial networks (GANs) are a widely used framework for learning generative models.

Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks

no code implementations ICLR 2020 Yu Bai, Jason D. Lee

Recent theoretical work has established connections between over-parametrized neural networks and linearized models governed by he Neural Tangent Kernels (NTKs).

Optimal transport mapping via input convex neural networks

2 code implementations ICML 2020 Ashok Vardhan Makkuva, Amirhossein Taghvaei, Sewoong Oh, Jason D. Lee

Building upon recent advances in the field of input convex neural networks, we propose a new framework where the gradient of one convex function represents the optimal transport mapping.

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

no code implementations1 Aug 2019 Alekh Agarwal, Sham M. Kakade, Jason D. Lee, Gaurav Mahajan

Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces.

Policy Gradient Methods

Convergence of Adversarial Training in Overparametrized Neural Networks

no code implementations NeurIPS 2019 Ruiqi Gao, Tianle Cai, Haochuan Li, Li-Wei Wang, Cho-Jui Hsieh, Jason D. Lee

Neural networks are vulnerable to adversarial examples, i. e. inputs that are imperceptibly perturbed from natural data and yet incorrectly classified by the network.

Neural Temporal-Difference and Q-Learning Provably Converge to Global Optima

1 code implementation NeurIPS 2019 Qi Cai, Zhuoran Yang, Jason D. Lee, Zhaoran Wang

Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning.

Q-Learning

Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models

no code implementations17 May 2019 Mor Shpigel Nacson, Suriya Gunasekar, Jason D. Lee, Nathan Srebro, Daniel Soudry

With an eye toward understanding complexity control in deep learning, we study how infinitesimal regularization or gradient descent optimization lead to margin maximizing solutions in both homogeneous and non-homogeneous models, extending previous work that focused on infinitesimal regularization only in homogeneous models.

Solving a Class of Non-Convex Min-Max Games Using Iterative First Order Methods

1 code implementation NeurIPS 2019 Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D. Lee, Meisam Razaviyayn

In this paper, we study the problem in the non-convex regime and show that an \varepsilon--first order stationary point of the game can be computed when one of the player's objective can be optimized to global optimality efficiently.

Provably Correct Automatic Sub-Differentiation for Qualified Programs

no code implementations NeurIPS 2018 Sham M. Kakade, Jason D. Lee

The \emph{Cheap Gradient Principle}~\citep{Griewank:2008:EDP:1455489} --- the computational cost of computing a $d$-dimensional vector of partial derivatives of a scalar function is nearly the same (often within a factor of $5$) as that of simply computing the scalar function itself --- is of central importance in optimization; it allows us to quickly obtain (high-dimensional) gradients of scalar loss functions which are subsequently used in black box gradient-based optimization procedures.

Gradient Descent Finds Global Minima of Deep Neural Networks

no code implementations9 Nov 2018 Simon S. Du, Jason D. Lee, Haochuan Li, Li-Wei Wang, Xiyu Zhai

Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex.

Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

no code implementations NeurIPS 2019 Colin Wei, Jason D. Lee, Qiang Liu, Tengyu Ma

We prove that for infinite-width two-layer nets, noisy gradient descent optimizes the regularized neural net loss to a global minimum in polynomial iterations.

Provably Correct Automatic Subdifferentiation for Qualified Programs

no code implementations23 Sep 2018 Sham Kakade, Jason D. Lee

The Cheap Gradient Principle (Griewank 2008) --- the computational cost of computing the gradient of a scalar-valued function is nearly the same (often within a factor of $5$) as that of simply computing the function itself --- is of central importance in optimization; it allows us to quickly obtain (high dimensional) gradients of scalar loss functions which are subsequently used in black box gradient-based optimization procedures.

Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced

no code implementations NeurIPS 2018 Simon S. Du, Wei Hu, Jason D. Lee

Using a discretization argument, we analyze gradient descent with positive step size for the non-convex low-rank asymmetric matrix factorization problem without any regularization.

Adding One Neuron Can Eliminate All Bad Local Minima

no code implementations NeurIPS 2018 Shiyu Liang, Ruoyu Sun, Jason D. Lee, R. Srikant

One of the main difficulties in analyzing neural networks is the non-convexity of the loss function which may have many bad local minima.

Binary Classification General Classification

Stochastic subgradient method converges on tame functions

1 code implementation20 Apr 2018 Damek Davis, Dmitriy Drusvyatskiy, Sham Kakade, Jason D. Lee

This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity?

Convergence of Gradient Descent on Separable Data

no code implementations5 Mar 2018 Mor Shpigel Nacson, Jason D. Lee, Suriya Gunasekar, Pedro H. P. Savarese, Nathan Srebro, Daniel Soudry

We show that for a large family of super-polynomial tailed losses, gradient descent iterates on linear networks of any depth converge in the direction of $L_2$ maximum-margin solution, while this does not hold for losses with heavier tails.

On the Power of Over-parametrization in Neural Networks with Quadratic Activation

1 code implementation ICML 2018 Simon S. Du, Jason D. Lee

We provide new theoretical insights on why over-parametrization is effective in learning neural networks.

On the Convergence and Robustness of Training GANs with Regularized Optimal Transport

no code implementations NeurIPS 2018 Maziar Sanjabi, Jimmy Ba, Meisam Razaviyayn, Jason D. Lee

A popular GAN formulation is based on the use of Wasserstein distance as a metric between probability distributions.

Better Generalization by Efficient Trust Region Method

no code implementations ICLR 2018 Xuanqing Liu, Jason D. Lee, Cho-Jui Hsieh

Solving this subproblem is non-trivial---existing methods have only sub-linear convergence rate.

No Spurious Local Minima in a Two Hidden Unit ReLU Network

no code implementations ICLR 2018 Chenwei Wu, Jiajun Luo, Jason D. Lee

Deep learning models can be efficiently optimized via stochastic gradient descent, but there is little theoretical evidence to support this.

Vocal Bursts Valence Prediction

Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima

no code implementations ICML 2018 Simon S. Du, Jason D. Lee, Yuandong Tian, Barnabas Poczos, Aarti Singh

We consider the problem of learning a one-hidden-layer neural network with non-overlapping convolutional layer and ReLU activation, i. e., $f(\mathbf{Z}, \mathbf{w}, \mathbf{a}) = \sum_j a_j\sigma(\mathbf{w}^T\mathbf{Z}_j)$, in which both the convolutional weights $\mathbf{w}$ and the output weights $\mathbf{a}$ are parameters to be learned.

First-order Methods Almost Always Avoid Saddle Points

no code implementations20 Oct 2017 Jason D. Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael. I. Jordan, Benjamin Recht

We establish that first-order methods avoid saddle points for almost all initializations.

When is a Convolutional Filter Easy To Learn?

no code implementations ICLR 2018 Simon S. Du, Jason D. Lee, Yuandong Tian

We show that (stochastic) gradient descent with random initialization can learn the convolutional filter in polynomial time and the convergence rate depends on the smoothness of the input distribution and the closeness of patches.

An inexact subsampled proximal Newton-type method for large-scale machine learning

no code implementations28 Aug 2017 Xuanqing Liu, Cho-Jui Hsieh, Jason D. Lee, Yuekai Sun

We propose a fast proximal Newton-type algorithm for minimizing regularized finite sums that returns an $\epsilon$-suboptimal point in $\tilde{\mathcal{O}}(d(n + \sqrt{\kappa d})\log(\frac{1}{\epsilon}))$ FLOPS, where $n$ is number of samples, $d$ is feature dimension, and $\kappa$ is the condition number.

BIG-bench Machine Learning

Theoretical insights into the optimization landscape of over-parameterized shallow neural networks

no code implementations16 Jul 2017 Mahdi Soltanolkotabi, Adel Javanmard, Jason D. Lee

In this paper we study the problem of learning a shallow artificial neural network that best fits a training data set.

Gradient Descent Can Take Exponential Time to Escape Saddle Points

no code implementations NeurIPS 2017 Simon S. Du, Chi Jin, Jason D. Lee, Michael. I. Jordan, Barnabas Poczos, Aarti Singh

Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape.

A Flexible Framework for Hypothesis Testing in High-dimensions

no code implementations26 Apr 2017 Adel Javanmard, Jason D. Lee

By duality between hypotheses testing and confidence intervals, the proposed framework can be used to obtain valid confidence intervals for various functionals of the model parameters.

regression Two-sample testing +2

Statistical Inference for Model Parameters in Stochastic Gradient Descent

no code implementations27 Oct 2016 Xi Chen, Jason D. Lee, Xin T. Tong, Yichen Zhang

Second, for high-dimensional linear regression, using a variant of the SGD algorithm, we construct a debiased estimator of each regression coefficient that is asymptotically normal.

regression

Black-box Importance Sampling

no code implementations17 Oct 2016 Qiang Liu, Jason D. Lee

Importance sampling is widely used in machine learning and statistics, but its power is limited by the restriction of using simple proposals for which the importance weights can be tractably calculated.

BIG-bench Machine Learning

Sketching Meets Random Projection in the Dual: A Provable Recovery Algorithm for Big and High-dimensional Data

no code implementations10 Oct 2016 Jialei Wang, Jason D. Lee, Mehrdad Mahdavi, Mladen Kolar, Nathan Srebro

Sketching techniques have become popular for scaling up machine learning algorithms by reducing the sample size or dimensionality of massive data sets, while still maintaining the statistical power of big data.

Communication-Efficient Distributed Statistical Inference

no code implementations25 May 2016 Michael. I. Jordan, Jason D. Lee, Yun Yang

CSL provides a communication-efficient surrogate to the global likelihood that can be used for low-dimensional estimation, high-dimensional regularized estimation and Bayesian inference.

Bayesian Inference Computational Efficiency

Matrix Completion has No Spurious Local Minimum

no code implementations NeurIPS 2016 Rong Ge, Jason D. Lee, Tengyu Ma

Matrix completion is a basic machine learning problem that has wide applications, especially in collaborative filtering and recommender systems.

Collaborative Filtering Matrix Completion +1

Gradient Descent Converges to Minimizers

no code implementations16 Feb 2016 Jason D. Lee, Max Simchowitz, Michael. I. Jordan, Benjamin Recht

We show that gradient descent converges to a local minimizer, almost surely with random initialization.

A Kernelized Stein Discrepancy for Goodness-of-fit Tests and Model Evaluation

no code implementations10 Feb 2016 Qiang Liu, Jason D. Lee, Michael. I. Jordan

We derive a new discrepancy statistic for measuring differences between two probability distributions based on combining Stein's identity with the reproducing kernel Hilbert space theory.

Evaluating the statistical significance of biclusters

no code implementations NeurIPS 2015 Jason D. Lee, Yuekai Sun, Jonathan E. Taylor

Biclustering (also known as submatrix localization) is a problem of high practical relevance in exploratory analysis of high-dimensional data.

Learning Halfspaces and Neural Networks with Random Initialization

no code implementations25 Nov 2015 Yuchen Zhang, Jason D. Lee, Martin J. Wainwright, Michael. I. Jordan

For loss functions that are $L$-Lipschitz continuous, we present algorithms to learn halfspaces and multi-layer neural networks that achieve arbitrarily small excess risk $\epsilon>0$.

$\ell_1$-regularized Neural Networks are Improperly Learnable in Polynomial Time

no code implementations13 Oct 2015 Yuchen Zhang, Jason D. Lee, Michael. I. Jordan

The sample complexity and the time complexity of the presented method are polynomial in the input dimension and in $(1/\epsilon,\log(1/\delta), F(k, L))$, where $F(k, L)$ is a function depending on $(k, L)$ and on the activation function, independent of the number of neurons.

Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity

no code implementations27 Jul 2015 Jason D. Lee, Qihang Lin, Tengyu Ma, Tianbao Yang

We also prove a lower bound for the number of rounds of communication for a broad class of distributed first-order methods including the proposed algorithms in this paper.

Distributed Optimization

Selective Inference and Learning Mixed Graphical Models

no code implementations30 Jun 2015 Jason D. Lee

We present the Condition-on-Selection method that allows for valid selective inference, and study its application to the lasso, and several other selection algorithms.

Model Selection valid

Communication-efficient sparse regression: a one-shot approach

no code implementations14 Mar 2015 Jason D. Lee, Yuekai Sun, Qiang Liu, Jonathan E. Taylor

We devise a one-shot approach to distributed sparse regression in the high-dimensional setting.

regression

Scalable methods for nonnegative matrix factorizations of near-separable tall-and-skinny matrices

1 code implementation NeurIPS 2014 Austin R. Benson, Jason D. Lee, Bartek Rajwa, David F. Gleich

We demonstrate the efficacy of these algorithms on terabyte-sized synthetic matrices and real-world matrices from scientific computing and bioinformatics.

On model selection consistency of penalized M-estimators: a geometric theory

no code implementations NeurIPS 2013 Jason D. Lee, Yuekai Sun, Jonathan E. Taylor

Penalized M-estimators are used in diverse areas of science and engineering to fit high-dimensional models with some low-dimensional structure.

Model Selection

Using Multiple Samples to Learn Mixture Models

no code implementations NeurIPS 2013 Jason D. Lee, Ran Gilad-Bachrach, Rich Caruana

In the mixture models problem it is assumed that there are $K$ distributions $\theta_{1},\ldots,\theta_{K}$ and one gets to observe a sample from a mixture of these distributions with unknown coefficients.

On model selection consistency of regularized M-estimators

no code implementations31 May 2013 Jason D. Lee, Yuekai Sun, Jonathan E. Taylor

Regularized M-estimators are used in diverse areas of science and engineering to fit high-dimensional models with some low-dimensional structure.

Model Selection

Proximal Newton-type methods for minimizing composite functions

1 code implementation7 Jun 2012 Jason D. Lee, Yuekai Sun, Michael A. Saunders

We generalize Newton-type methods for minimizing smooth functions to handle a sum of two convex functions: a smooth function and a nonsmooth function with a simple proximal mapping.

Vocal Bursts Type Prediction

Learning Mixed Graphical Models

no code implementations22 May 2012 Jason D. Lee, Trevor J. Hastie

We present a new pairwise model for graphical models with both continuous and discrete variables that is amenable to structure learning.

Practical Large-Scale Optimization for Max-norm Regularization

no code implementations NeurIPS 2010 Jason D. Lee, Ben Recht, Nathan Srebro, Joel Tropp, Ruslan R. Salakhutdinov

The max-norm was proposed as a convex matrix regularizer by Srebro et al (2004) and was shown to be empirically superior to the trace-norm for collaborative filtering problems.

Clustering Collaborative Filtering

Cannot find the paper you are looking for? You can Submit a new open access paper.