Search Results for author: Rachel Ward

Found 38 papers, 5 papers with code

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

no code implementations22 Apr 2024 Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen Eldan, Dan Iter, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Chen Liang, Weishung Liu, Eric Lin, Zeqi Lin, Piyush Madan, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Xia Song, Masahiro Tanaka, Xin Wang, Rachel Ward, Guanhua Wang, Philipp Witte, Michael Wyatt, Can Xu, Jiahang Xu, Sonali Yadav, Fan Yang, ZiYi Yang, Donghan Yu, Chengruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang, Xiren Zhou

We introduce phi-3-mini, a 3. 8 billion parameter language model trained on 3. 3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3. 5 (e. g., phi-3-mini achieves 69% on MMLU and 8. 38 on MT-bench), despite being small enough to be deployed on a phone.

Language Modelling

TinyGSM: achieving >80% on GSM8k with small language models

no code implementations14 Dec 2023 Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, Yi Zhang

Specifically for solving grade school math, the smallest model size so far required to break the 80\% barrier on the GSM8K benchmark remains to be 34B.

Arithmetic Reasoning GSM8K +2

Convergence of Alternating Gradient Descent for Matrix Factorization

no code implementations NeurIPS 2023 Rachel Ward, Tamara G. Kolda

We show that, for a rank-$r$ matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$, $T = C (\frac{\sigma_1(\mathbf{A})}{\sigma_r(\mathbf{A})})^2 \log(1/\epsilon)$ iterations of alternating gradient descent suffice to reach an $\epsilon$-optimal factorization $\| \mathbf{A} - \mathbf{X} \mathbf{Y}^{T} \|^2 \leq \epsilon \| \mathbf{A}\|^2$ with high probability starting from an atypical random initialization.

Robust Implicit Regularization via Weight Normalization

no code implementations9 May 2023 Hung-Hsu Chou, Holger Rauhut, Rachel Ward

By analyzing key invariants of the gradient flow and using Lojasiewicz Theorem, we show that weight normalization also has an implicit bias towards sparse solutions in the diagonal linear model, but that in contrast to plain gradient flow, weight normalization enables a robust bias that persists even if the weights are initialized at practically large scale.

Adaptively Weighted Data Augmentation Consistency Regularization for Robust Optimization under Concept Shift

no code implementations4 Oct 2022 Yijun Dong, Yuege Xie, Rachel Ward

At the saddle point of the underlying objective, the weights assign label-dense samples to the supervised loss and label-sparse samples to the unsupervised consistency regularization.

Data Augmentation Image Segmentation +3

On the fast convergence of minibatch heavy ball momentum

no code implementations15 Jun 2022 Raghu Bollapragada, Tyler Chen, Rachel Ward

Simple stochastic momentum methods are widely used in machine learning optimization, but their good practical performance is at odds with an absence of theoretical guarantees of acceleration in the literature.

How catastrophic can catastrophic forgetting be in linear regression?

no code implementations19 May 2022 Itay Evron, Edward Moroshko, Rachel Ward, Nati Srebro, Daniel Soudry

In specific settings, we highlight differences between forgetting and convergence to the offline solution as studied in those areas.

Continual Learning regression

An Exponentially Increasing Step-size for Parameter Estimation in Statistical Models

no code implementations16 May 2022 Nhat Ho, Tongzheng Ren, Sujay Sanghavi, Purnamrita Sarkar, Rachel Ward

Therefore, the total computational complexity of the EGD algorithm is \emph{optimal} and exponentially cheaper than that of the GD for solving parameter estimation in non-regular statistical models while being comparable to that of the GD in regular statistical settings.

Concentration of Random Feature Matrices in High-Dimensions

no code implementations14 Apr 2022 Zhijun Chen, Hayden Schaeffer, Rachel Ward

The spectra of random feature matrices provide essential information on the conditioning of the linear system used in random feature regression problems and are thus connected to the consistency and generalization of random feature models.

Vocal Bursts Intensity Prediction

Side Effects of Learning from Low-dimensional Data Embedded in a Euclidean Space

no code implementations1 Mar 2022 Juncai He, Richard Tsai, Rachel Ward

In this setting, a typical neural network defines a function that takes a finite number of vectors in the embedding space as input.

Sample Efficiency of Data Augmentation Consistency Regularization

no code implementations24 Feb 2022 Shuo Yang, Yijun Dong, Rachel Ward, Inderjit S. Dhillon, Sujay Sanghavi, Qi Lei

Data augmentation is popular in the training of large neural networks; currently, however, there is no clear theoretical comparison between different algorithmic choices on how to use augmented data.

Data Augmentation Generalization Bounds

The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance

no code implementations11 Feb 2022 Matthew Faw, Isidoros Tziotis, Constantine Caramanis, Aryan Mokhtari, Sanjay Shakkottai, Rachel Ward

We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives.

SHRIMP: Sparser Random Feature Models via Iterative Magnitude Pruning

1 code implementation7 Dec 2021 Yuege Xie, Bobby Shi, Hayden Schaeffer, Rachel Ward

Inspired by the success of the iterative magnitude pruning technique in finding lottery tickets of neural networks, we propose a new method -- Sparser Random Feature Models via IMP (ShRIMP) -- to efficiently fit high-dimensional data with inherent low-dimensional structure in the form of sparse variable dependencies.

Additive models Computational Efficiency +1

Theoretical Analysis of Consistency Regularization with Limited Augmented Data

no code implementations29 Sep 2021 Shuo Yang, Yijun Dong, Rachel Ward, Inderjit S Dhillon, Sujay Sanghavi, Qi Lei

Data augmentation is popular in the training of large neural networks; currently, however, there is no clear theoretical comparison between different algorithmic choices on how to use augmented data.

Data Augmentation Generalization Bounds +1

Learning to Forecast Dynamical Systems from Streaming Data

1 code implementation20 Sep 2021 Dimitris Giannakis, Amelia Henriksen, Joel A. Tropp, Rachel Ward

This algorithm dramatically reduces the costs of training and prediction without sacrificing forecasting skill.

regression Time Series +1

AdaLoss: A computationally-efficient and provably convergent adaptive gradient method

no code implementations17 Sep 2021 Xiaoxia Wu, Yuege Xie, Simon Du, Rachel Ward

We propose a computationally-friendly adaptive learning rate schedule, "AdaLoss", which directly uses the information of the loss function to adjust the stepsize in gradient descent methods.

regression

Bootstrapping the error of Oja's algorithm

no code implementations NeurIPS 2021 Robert Lunde, Purnamrita Sarkar, Rachel Ward

We consider the problem of quantifying uncertainty for the estimation error of the leading eigenvector from Oja's algorithm for streaming principal component analysis, where the data are generated IID from some unknown distribution.

Generalization Bounds for Sparse Random Feature Expansions

2 code implementations4 Mar 2021 Abolfazl Hashemi, Hayden Schaeffer, Robert Shi, Ufuk Topcu, Giang Tran, Rachel Ward

In particular, we provide generalization bounds for functions in a certain class (that is dense in a reproducing kernel Hilbert space) depending on the number of samples and the distribution of features.

BIG-bench Machine Learning Compressive Sensing +1

Streaming k-PCA: Efficient guarantees for Oja's algorithm, beyond rank-one updates

no code implementations6 Feb 2021 De Huang, Jonathan Niles-Weed, Rachel Ward

We analyze Oja's algorithm for streaming $k$-PCA and prove that it achieves performance nearly matching that of an optimal offline algorithm.

Overparameterization and generalization error: weighted trigonometric interpolation

no code implementations15 Jun 2020 Yuege Xie, Hung-Hsu Chou, Holger Rauhut, Rachel Ward

Motivated by surprisingly good generalization properties of learned deep neural networks in overparameterized scenarios and by the related double descent phenomenon, this paper analyzes the relation between smoothness and low generalization error in an overparameterized linear learning problem.

Linear Convergence of Adaptive Stochastic Gradient Descent

no code implementations28 Aug 2019 Yuege Xie, Xiaoxia Wu, Rachel Ward

We prove that the norm version of the adaptive stochastic gradient method (AdaGrad-Norm) achieves a linear convergence rate for a subset of either strongly convex functions or non-convex functions that satisfy the Polyak Lojasiewicz (PL) inequality.

Bias of Homotopic Gradient Descent for the Hinge Loss

no code implementations26 Jul 2019 Denali Molitor, Deanna Needell, Rachel Ward

Gradient descent is a simple and widely used optimization method for machine learning.

BIG-bench Machine Learning

AdaOja: Adaptive Learning Rates for Streaming PCA

1 code implementation28 May 2019 Amelia Henriksen, Rachel Ward

We also show that AdaOja performs comparably to state-of-the-art algorithms (History PCA and Streaming Power Method) in the same streaming PCA setting.

Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network

no code implementations19 Feb 2019 Xiaoxia Wu, Simon S. Du, Rachel Ward

Adaptive gradient methods like AdaGrad are widely used in optimizing neural networks.

Recovery guarantees for polynomial approximation from dependent data with outliers

no code implementations25 Nov 2018 Lam Si Tung Ho, Hayden Schaeffer, Giang Tran, Rachel Ward

In this work, we study the problem of learning nonlinear functions from corrupted and dependent data.

AdaGrad stepsizes: Sharp convergence over nonconvex landscapes

1 code implementation5 Jun 2018 Rachel Ward, Xiaoxia Wu, Leon Bottou

Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the gradients received along the way; such methods have gained widespread use in large-scale optimization for their ability to converge robustly, without the need to fine-tune the stepsize schedule.

Stochastic Optimization

WNGrad: Learn the Learning Rate in Gradient Descent

no code implementations7 Mar 2018 Xiaoxia Wu, Rachel Ward, Léon Bottou

Adjusting the learning rate schedule in stochastic gradient methods is an important unresolved problem which requires tuning in practice.

A polynomial-time relaxation of the Gromov-Hausdorff distance

no code implementations17 Oct 2016 Soledad Villar, Afonso S. Bandeira, Andrew J. Blumberg, Rachel Ward

The Gromov-Hausdorff distance provides a metric on the set of isometry classes of compact metric spaces.

Clustering subgaussian mixtures by semidefinite programming

no code implementations22 Feb 2016 Dustin G. Mixon, Soledad Villar, Rachel Ward

We introduce a model-free relax-and-round algorithm for k-means clustering based on a semidefinite relaxation due to Peng and Wei.

Clustering

The local convexity of solving systems of quadratic equations

no code implementations25 Jun 2015 Chris D. White, Sujay Sanghavi, Rachel Ward

This paper considers the recovery of a rank $r$ positive semidefinite matrix $X X^T\in\mathbb{R}^{n\times n}$ from $m$ scalar measurements of the form $y_i := a_i^T X X^T a_i$ (i. e., quadratic measurements of $X$).

Quantum State Tomography

Relax, no need to round: integrality of clustering formulations

no code implementations18 Aug 2014 Pranjal Awasthi, Afonso S. Bandeira, Moses Charikar, Ravishankar Krishnaswamy, Soledad Villar, Rachel Ward

Under the same distributional model, the $k$-means LP relaxation fails to recover such clusters at separation as large as $\Delta = 4$.

Clustering

One-bit compressive sensing with norm estimation

no code implementations28 Apr 2014 Karin Knudson, Rayan Saab, Rachel Ward

Consider the recovery of an unknown signal ${x}$ from quantized linear measurements.

Compressive Sensing

Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm

no code implementations NeurIPS 2014 Deanna Needell, Nathan Srebro, Rachel Ward

Furthermore, we show how reweighting the sampling distribution (i. e. importance sampling) is necessary in order to further improve convergence, and obtain a linear dependence in the average smoothness, dominating previous results.

Recovery guarantees for exemplar-based clustering

no code implementations12 Sep 2013 Abhinav Nellore, Rachel Ward

For a certain class of distributions, we prove that the linear programming relaxation of $k$-medoids clustering---a variant of $k$-means clustering where means are replaced by exemplars from within the dataset---distinguishes points drawn from nonoverlapping balls with high probability once the number of points drawn and the separation distance between any two balls are sufficiently large.

Clustering

Completing Any Low-rank Matrix, Provably

no code implementations12 Jun 2013 Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi, Rachel Ward

Matrix completion, i. e., the exact and provable recovery of a low-rank matrix from a small subset of its elements, is currently only known to be possible if the matrix satisfies a restrictive structural constraint---known as {\em incoherence}---on its row and column spaces.

Matrix Completion

Near-optimal compressed sensing guarantees for total variation minimization

no code implementations11 Oct 2012 Deanna Needell, Rachel Ward

Consider the problem of reconstructing a multidimensional signal from an underdetermined set of measurements, as in the setting of compressed sensing.

Stable and robust sampling strategies for compressive imaging

no code implementations8 Oct 2012 Felix Krahmer, Rachel Ward

For Fourier measurements and Haar wavelet sparsity, the local coherence can be controlled and bounded explicitly, so for matrices comprised of frequencies sampled from a suitable inverse square power-law density, we can prove the restricted isometry property with near-optimal embedding dimensions.

Compressive Sensing Image Reconstruction

Cannot find the paper you are looking for? You can Submit a new open access paper.