Search Results for author: Rachel Ward

Found 38 papers, 5 papers with code

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

no code implementations • 22 Apr 2024 • Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen Eldan, Dan Iter, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Chen Liang, Weishung Liu, Eric Lin, Zeqi Lin, Piyush Madan, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Xia Song, Masahiro Tanaka, Xin Wang, Rachel Ward, Guanhua Wang, Philipp Witte, Michael Wyatt, Can Xu, Jiahang Xu, Sonali Yadav, Fan Yang, ZiYi Yang, Donghan Yu, Chengruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang, Xiren Zhou

We introduce phi-3-mini, a 3. 8 billion parameter language model trained on 3. 3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3. 5 (e. g., phi-3-mini achieves 69% on MMLU and 8. 38 on MT-bench), despite being small enough to be deployed on a phone.

Language Modelling

Paper
Add Code

TinyGSM: achieving >80% on GSM8k with small language models

no code implementations • 14 Dec 2023 • Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, Yi Zhang

Specifically for solving grade school math, the smallest model size so far required to break the 80\% barrier on the GSM8K benchmark remains to be 34B.

Ranked #58 on Arithmetic Reasoning on GSM8K

Arithmetic Reasoning GSM8K +2

Paper
Add Code

Convergence of Alternating Gradient Descent for Matrix Factorization

no code implementations • NeurIPS 2023 • Rachel Ward, Tamara G. Kolda

We show that, for a rank-$r$ matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$, $T = C (\frac{\sigma_1(\mathbf{A})}{\sigma_r(\mathbf{A})})^2 \log(1/\epsilon)$ iterations of alternating gradient descent suffice to reach an $\epsilon$-optimal factorization $\| \mathbf{A} - \mathbf{X} \mathbf{Y}^{T} \|^2 \leq \epsilon \| \mathbf{A}\|^2$ with high probability starting from an atypical random initialization.

Paper
Add Code

Robust Implicit Regularization via Weight Normalization

no code implementations • 9 May 2023 • Hung-Hsu Chou, Holger Rauhut, Rachel Ward

By analyzing key invariants of the gradient flow and using Lojasiewicz Theorem, we show that weight normalization also has an implicit bias towards sparse solutions in the diagonal linear model, but that in contrast to plain gradient flow, weight normalization enables a robust bias that persists even if the weights are initialized at practically large scale.

Paper
Add Code

Adaptively Weighted Data Augmentation Consistency Regularization for Robust Optimization under Concept Shift

no code implementations • 4 Oct 2022 • Yijun Dong, Yuege Xie, Rachel Ward

At the saddle point of the underlying objective, the weights assign label-dense samples to the supervised loss and label-sparse samples to the unsupervised consistency regularization.

Data Augmentation Image Segmentation +3

Paper
Add Code

On the fast convergence of minibatch heavy ball momentum

no code implementations • 15 Jun 2022 • Raghu Bollapragada, Tyler Chen, Rachel Ward

Simple stochastic momentum methods are widely used in machine learning optimization, but their good practical performance is at odds with an absence of theoretical guarantees of acceleration in the literature.

Paper
Add Code

How catastrophic can catastrophic forgetting be in linear regression?

no code implementations • 19 May 2022 • Itay Evron, Edward Moroshko, Rachel Ward, Nati Srebro, Daniel Soudry

In specific settings, we highlight differences between forgetting and convergence to the offline solution as studied in those areas.

Continual Learning regression

Paper
Add Code

An Exponentially Increasing Step-size for Parameter Estimation in Statistical Models

no code implementations • 16 May 2022 • Nhat Ho, Tongzheng Ren, Sujay Sanghavi, Purnamrita Sarkar, Rachel Ward

Therefore, the total computational complexity of the EGD algorithm is \emph{optimal} and exponentially cheaper than that of the GD for solving parameter estimation in non-regular statistical models while being comparable to that of the GD in regular statistical settings.

Paper
Add Code

Concentration of Random Feature Matrices in High-Dimensions

no code implementations • 14 Apr 2022 • Zhijun Chen, Hayden Schaeffer, Rachel Ward

The spectra of random feature matrices provide essential information on the conditioning of the linear system used in random feature regression problems and are thus connected to the consistency and generalization of random feature models.

Vocal Bursts Intensity Prediction

Paper
Add Code

Side Effects of Learning from Low-dimensional Data Embedded in a Euclidean Space

no code implementations • 1 Mar 2022 • Juncai He, Richard Tsai, Rachel Ward

In this setting, a typical neural network defines a function that takes a finite number of vectors in the embedding space as input.

Paper
Add Code

Sample Efficiency of Data Augmentation Consistency Regularization

no code implementations • 24 Feb 2022 • Shuo Yang, Yijun Dong, Rachel Ward, Inderjit S. Dhillon, Sujay Sanghavi, Qi Lei

Data augmentation is popular in the training of large neural networks; currently, however, there is no clear theoretical comparison between different algorithmic choices on how to use augmented data.

Data Augmentation Generalization Bounds

Paper
Add Code

The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance

no code implementations • 11 Feb 2022 • Matthew Faw, Isidoros Tziotis, Constantine Caramanis, Aryan Mokhtari, Sanjay Shakkottai, Rachel Ward

We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives.

Paper
Add Code

SHRIMP: Sparser Random Feature Models via Iterative Magnitude Pruning

1 code implementation • 7 Dec 2021 • Yuege Xie, Bobby Shi, Hayden Schaeffer, Rachel Ward

Inspired by the success of the iterative magnitude pruning technique in finding lottery tickets of neural networks, we propose a new method -- Sparser Random Feature Models via IMP (ShRIMP) -- to efficiently fit high-dimensional data with inherent low-dimensional structure in the form of sparse variable dependencies.

Additive models Computational Efficiency +1

Paper
Code

Theoretical Analysis of Consistency Regularization with Limited Augmented Data

no code implementations • 29 Sep 2021 • Shuo Yang, Yijun Dong, Rachel Ward, Inderjit S Dhillon, Sujay Sanghavi, Qi Lei

Data augmentation is popular in the training of large neural networks; currently, however, there is no clear theoretical comparison between different algorithmic choices on how to use augmented data.

Data Augmentation Generalization Bounds +1

Paper
Add Code

Learning to Forecast Dynamical Systems from Streaming Data

1 code implementation • 20 Sep 2021 • Dimitris Giannakis, Amelia Henriksen, Joel A. Tropp, Rachel Ward

This algorithm dramatically reduces the costs of training and prediction without sacrificing forecasting skill.

regression Time Series +1

Paper
Code

AdaLoss: A computationally-efficient and provably convergent adaptive gradient method

no code implementations • 17 Sep 2021 • Xiaoxia Wu, Yuege Xie, Simon Du, Rachel Ward

We propose a computationally-friendly adaptive learning rate schedule, "AdaLoss", which directly uses the information of the loss function to adjust the stepsize in gradient descent methods.

regression

Paper
Add Code

Bootstrapping the error of Oja's algorithm

no code implementations • NeurIPS 2021 • Robert Lunde, Purnamrita Sarkar, Rachel Ward

We consider the problem of quantifying uncertainty for the estimation error of the leading eigenvector from Oja's algorithm for streaming principal component analysis, where the data are generated IID from some unknown distribution.

Paper
Add Code

Generalization Bounds for Sparse Random Feature Expansions

2 code implementations • 4 Mar 2021 • Abolfazl Hashemi, Hayden Schaeffer, Robert Shi, Ufuk Topcu, Giang Tran, Rachel Ward

In particular, we provide generalization bounds for functions in a certain class (that is dense in a reproducing kernel Hilbert space) depending on the number of samples and the distribution of features.

BIG-bench Machine Learning Compressive Sensing +1

Paper
Code

Streaming k-PCA: Efficient guarantees for Oja's algorithm, beyond rank-one updates

no code implementations • 6 Feb 2021 • De Huang, Jonathan Niles-Weed, Rachel Ward

We analyze Oja's algorithm for streaming $k$-PCA and prove that it achieves performance nearly matching that of an optimal offline algorithm.

Paper
Add Code

Overparameterization and generalization error: weighted trigonometric interpolation

no code implementations • 15 Jun 2020 • Yuege Xie, Hung-Hsu Chou, Holger Rauhut, Rachel Ward

Motivated by surprisingly good generalization properties of learned deep neural networks in overparameterized scenarios and by the related double descent phenomenon, this paper analyzes the relation between smoothness and low generalization error in an overparameterized linear learning problem.

Paper
Add Code

Implicit Regularization and Convergence for Weight Normalization

no code implementations • NeurIPS 2020 • Xiaoxia Wu, Edgar Dobriban, Tongzheng Ren, Shanshan Wu, Zhiyuan Li, Suriya Gunasekar, Rachel Ward, Qiang Liu

For certain stepsizes of g and w , we show that they can converge close to the minimum norm solution.

Paper
Add Code

Linear Convergence of Adaptive Stochastic Gradient Descent

no code implementations • 28 Aug 2019 • Yuege Xie, Xiaoxia Wu, Rachel Ward

We prove that the norm version of the adaptive stochastic gradient method (AdaGrad-Norm) achieves a linear convergence rate for a subset of either strongly convex functions or non-convex functions that satisfy the Polyak Lojasiewicz (PL) inequality.

Paper
Add Code

Bias of Homotopic Gradient Descent for the Hinge Loss

no code implementations • 26 Jul 2019 • Denali Molitor, Deanna Needell, Rachel Ward

Gradient descent is a simple and widely used optimization method for machine learning.

BIG-bench Machine Learning

Paper
Add Code

AdaOja: Adaptive Learning Rates for Streaming PCA

1 code implementation • 28 May 2019 • Amelia Henriksen, Rachel Ward

We also show that AdaOja performs comparably to state-of-the-art algorithms (History PCA and Streaming Power Method) in the same streaming PCA setting.

Paper
Code

Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network

no code implementations • 19 Feb 2019 • Xiaoxia Wu, Simon S. Du, Rachel Ward

Adaptive gradient methods like AdaGrad are widely used in optimizing neural networks.

Paper
Add Code

Recovery guarantees for polynomial approximation from dependent data with outliers

no code implementations • 25 Nov 2018 • Lam Si Tung Ho, Hayden Schaeffer, Giang Tran, Rachel Ward

In this work, we study the problem of learning nonlinear functions from corrupted and dependent data.

Paper
Add Code

AdaGrad stepsizes: Sharp convergence over nonconvex landscapes

1 code implementation • 5 Jun 2018 • Rachel Ward, Xiaoxia Wu, Leon Bottou

Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the gradients received along the way; such methods have gained widespread use in large-scale optimization for their ability to converge robustly, without the need to fine-tune the stepsize schedule.

Stochastic Optimization

Paper
Code

WNGrad: Learn the Learning Rate in Gradient Descent

no code implementations • 7 Mar 2018 • Xiaoxia Wu, Rachel Ward, Léon Bottou

Adjusting the learning rate schedule in stochastic gradient methods is an important unresolved problem which requires tuning in practice.

Paper
Add Code

A polynomial-time relaxation of the Gromov-Hausdorff distance

no code implementations • 17 Oct 2016 • Soledad Villar, Afonso S. Bandeira, Andrew J. Blumberg, Rachel Ward

The Gromov-Hausdorff distance provides a metric on the set of isometry classes of compact metric spaces.

Paper
Add Code

Clustering subgaussian mixtures by semidefinite programming

no code implementations • 22 Feb 2016 • Dustin G. Mixon, Soledad Villar, Rachel Ward

We introduce a model-free relax-and-round algorithm for k-means clustering based on a semidefinite relaxation due to Peng and Wei.

Clustering

Paper
Add Code

The local convexity of solving systems of quadratic equations

no code implementations • 25 Jun 2015 • Chris D. White, Sujay Sanghavi, Rachel Ward

This paper considers the recovery of a rank $r$ positive semidefinite matrix $X X^T\in\mathbb{R}^{n\times n}$ from $m$ scalar measurements of the form $y_i := a_i^T X X^T a_i$ (i. e., quadratic measurements of $X$).

Quantum State Tomography

Paper
Add Code

Relax, no need to round: integrality of clustering formulations

no code implementations • 18 Aug 2014 • Pranjal Awasthi, Afonso S. Bandeira, Moses Charikar, Ravishankar Krishnaswamy, Soledad Villar, Rachel Ward

Under the same distributional model, the $k$-means LP relaxation fails to recover such clusters at separation as large as $\Delta = 4$.

Clustering

Paper
Add Code

One-bit compressive sensing with norm estimation

no code implementations • 28 Apr 2014 • Karin Knudson, Rayan Saab, Rachel Ward

Consider the recovery of an unknown signal ${x}$ from quantized linear measurements.

Compressive Sensing

Paper
Add Code

Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm

no code implementations • NeurIPS 2014 • Deanna Needell, Nathan Srebro, Rachel Ward

Furthermore, we show how reweighting the sampling distribution (i. e. importance sampling) is necessary in order to further improve convergence, and obtain a linear dependence in the average smoothness, dominating previous results.

Paper
Add Code

Recovery guarantees for exemplar-based clustering

no code implementations • 12 Sep 2013 • Abhinav Nellore, Rachel Ward

For a certain class of distributions, we prove that the linear programming relaxation of $k$-medoids clustering---a variant of $k$-means clustering where means are replaced by exemplars from within the dataset---distinguishes points drawn from nonoverlapping balls with high probability once the number of points drawn and the separation distance between any two balls are sufficiently large.

Clustering

Paper
Add Code

Completing Any Low-rank Matrix, Provably

no code implementations • 12 Jun 2013 • Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi, Rachel Ward

Matrix completion, i. e., the exact and provable recovery of a low-rank matrix from a small subset of its elements, is currently only known to be possible if the matrix satisfies a restrictive structural constraint---known as {\em incoherence}---on its row and column spaces.

Matrix Completion

Paper
Add Code

Near-optimal compressed sensing guarantees for total variation minimization

no code implementations • 11 Oct 2012 • Deanna Needell, Rachel Ward

Consider the problem of reconstructing a multidimensional signal from an underdetermined set of measurements, as in the setting of compressed sensing.

Paper
Add Code

Stable and robust sampling strategies for compressive imaging

no code implementations • 8 Oct 2012 • Felix Krahmer, Rachel Ward

For Fourier measurements and Haar wavelet sparsity, the local coherence can be controlled and bounded explicitly, so for matrices comprised of frequencies sampled from a suitable inverse square power-law density, we can prove the restricted isometry property with near-optimal embedding dimensions.

Compressive Sensing Image Reconstruction

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.