Search Results for author: Chulhee Yun

Found 26 papers, 4 papers with code

Position Coupling: Leveraging Task Structure for Improved Length Generalization of Transformers

1 code implementation • 31 May 2024 • Hanseul Cho, Jaeyoung Cha, Pranjal Awasthi, Srinadh Bhojanapalli, Anupam Gupta, Chulhee Yun

Even for simple arithmetic tasks like integer addition, it is challenging for Transformers to generalize to longer sequences than those encountered during training.

Decoder Position

Paper
Code

Does SGD really happen in tiny subspaces?

no code implementations • 25 May 2024 • Minhak Song, Kwangjun Ahn, Chulhee Yun

This suggests that the observed alignment between the gradient and the dominant subspace is spurious.

Paper
Add Code

Fundamental Benefit of Alternating Updates in Minimax Optimization

no code implementations • 16 Feb 2024 • Jaewook Lee, Hanseul Cho, Chulhee Yun

The Gradient Descent-Ascent (GDA) algorithm, designed to solve minimax optimization problems, takes the descent and ascent steps either simultaneously (Sim-GDA) or alternately (Alt-GDA).

Paper
Add Code

Gradient Descent with Polyak's Momentum Finds Flatter Minima via Large Catapults

no code implementations • 25 Nov 2023 • Prin Phunyaphibarn, Junghyun Lee, Bohan Wang, Huishuai Zhang, Chulhee Yun

Although gradient descent with Polyak's momentum is widely used in modern machine and deep learning, a concrete understanding of its effects on the training trajectory remains elusive.

Paper
Add Code

Fair Streaming Principal Component Analysis: Statistical and Algorithmic Viewpoint

1 code implementation • NeurIPS 2023 • Junghyun Lee, Hanseul Cho, Se-Young Yun, Chulhee Yun

Fair Principal Component Analysis (PCA) is a problem setting where we aim to perform PCA while making the resulting representation fair in that the projected distributions, conditional on the sensitive attributes, match one another.

Paper
Code

Linear attention is (maybe) all you need (to understand transformer optimization)

1 code implementation • 2 Oct 2023 • Kwangjun Ahn, Xiang Cheng, Minhak Song, Chulhee Yun, Ali Jadbabaie, Suvrit Sra

Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics.

Paper
Code

Provable Benefit of Mixup for Finding Optimal Decision Boundaries

no code implementations • 1 Jun 2023 • Junsoo Oh, Chulhee Yun

We investigate how pair-wise data augmentation techniques like Mixup affect the sample complexity of finding optimal decision boundaries in a binary linear classification problem.

Data Augmentation

Paper
Add Code

Tighter Lower Bounds for Shuffling SGD: Random Permutations and Beyond

1 code implementation • 13 Mar 2023 • Jaeyoung Cha, Jaewook Lee, Chulhee Yun

We study convergence lower bounds of without-replacement stochastic gradient descent (SGD) for solving smooth (strongly-)convex finite-sum minimization problems.

Paper
Code

On the Training Instability of Shuffling SGD with Batch Normalization

no code implementations • 24 Feb 2023 • David X. Wu, Chulhee Yun, Suvrit Sra

We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence.

regression

Paper
Add Code

SGDA with shuffling: faster convergence for nonconvex-PŁ minimax optimization

no code implementations • 12 Oct 2022 • Hanseul Cho, Chulhee Yun

Stochastic gradient descent-ascent (SGDA) is one of the main workhorses for solving finite-sum minimax optimization problems.

Paper
Add Code

Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond

no code implementations • ICLR 2022 • Chulhee Yun, Shashank Rajput, Suvrit Sra

In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods.

Paper
Add Code

Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?

no code implementations • 12 Mar 2021 • Chulhee Yun, Suvrit Sra, Ali Jadbabaie

We propose matrix norm inequalities that extend the Recht-R\'e (2012) conjecture on a noncommutative AM-GM inequality by supplementing it with another inequality that accounts for single-shuffle, which is a widely used without-replacement sampling scheme that shuffles only once in the beginning and is overlooked in the Recht-R\'e conjecture.

Paper
Add Code

O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

no code implementations • NeurIPS 2020 • Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function.

Paper
Add Code

Provable Memorization via Deep Neural Networks using Sub-linear Parameters

no code implementations • 26 Oct 2020 • Sejun Park, Jaeho Lee, Chulhee Yun, Jinwoo Shin

It is known that $O(N)$ parameters are sufficient for neural networks to memorize arbitrary $N$ input-label pairs.

Memorization

Paper
Add Code

A Unifying View on Implicit Bias in Training Linear Neural Networks

no code implementations • ICLR 2021 • Chulhee Yun, Shankar Krishnan, Hossein Mobahi

For $L$-layer linear tensor networks that are orthogonally decomposable, we show that gradient flow on separable classification finds a stationary point of the $\ell_{2/L}$ max-margin problem in a "transformed" input space defined by the network.

Tensor Networks

Paper
Add Code

Minimum Width for Universal Approximation

no code implementations • ICLR 2021 • Sejun Park, Chulhee Yun, Jaeho Lee, Jinwoo Shin

In this work, we provide the first definitive result in this direction for networks using the ReLU activation functions: The minimum width required for the universal approximation of the $L^p$ functions is exactly $\max\{d_x+1, d_y\}$.

Paper
Add Code

SGD with shuffling: optimal rates without component convexity and large epoch requirements

no code implementations • NeurIPS 2020 • Kwangjun Ahn, Chulhee Yun, Suvrit Sra

We study without-replacement SGD for solving finite-sum optimization problems.

Paper
Add Code

$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers

no code implementations • NeurIPS 2020 • Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function.

Paper
Add Code

Low-Rank Bottleneck in Multi-head Attention Models

no code implementations • ICML 2020 • Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

Attention based Transformer architecture has enabled significant advances in the field of natural language processing.

Paper
Add Code

Are Transformers universal approximators of sequence-to-sequence functions?

no code implementations • ICLR 2020 • Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models.

Paper
Add Code

Concise Multi-head Attention Models

no code implementations • 25 Sep 2019 • Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

Attention based Transformer architecture has enabled significant advances in the field of natural language processing.

Paper
Add Code

Are deep ResNets provably better than linear predictors?

no code implementations • NeurIPS 2019 • Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor.

Paper
Add Code

Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity

no code implementations • NeurIPS 2019 • Chulhee Yun, Suvrit Sra, Ali Jadbabaie

We also prove that width $\Theta(\sqrt{N})$ is necessary and sufficient for memorizing $N$ data points, proving tight bounds on memorization capacity.

Memorization

Paper
Add Code

Efficiently testing local optimality and escaping saddles for ReLU networks

no code implementations • ICLR 2019 • Chulhee Yun, Suvrit Sra, Ali Jadbabaie

In the benign case, we solve one equality constrained QP, and we prove that projected gradient descent solves it exponentially fast.

Paper
Add Code

Small nonlinearities in activation functions create bad local minima in neural networks

no code implementations • ICLR 2019 • Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Our results thus indicate that in general "no spurious local minima" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust.

Paper
Add Code

Global optimality conditions for deep neural networks

no code implementations • ICLR 2018 • Chulhee Yun, Suvrit Sra, Ali Jadbabaie

We study the error landscape of deep linear and nonlinear neural networks with the squared error loss.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.