Search Results for author: Mark Schmidt

Found 54 papers, 21 papers with code

Enhancing Policy Gradient with the Polyak Step-Size Adaption

no code implementations • 11 Apr 2024 • Yunxiang Li, Rui Yuan, Chen Fan, Mark Schmidt, Samuel Horváth, Robert M. Gower, Martin Takáč

Policy gradient is a widely utilized and foundational algorithm in the field of reinforcement learning (RL).

Reinforcement Learning (RL)

Paper
Add Code

Faster Convergence of Stochastic Accelerated Gradient Descent under Interpolation

no code implementations • 3 Apr 2024 • Aaron Mishkin, Mert Pilanci, Mark Schmidt

This improvement is comparable to a square-root of the condition number in the worst case and address criticism that guarantees for stochastic acceleration could be worse than those for SGD.

Paper
Add Code

Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models

no code implementations • 29 Feb 2024 • Frederik Kunstner, Robin Yadav, Alan Milligan, Mark Schmidt, Alberto Bietti

We show that the heavy-tailed class imbalance found in language modeling tasks leads to difficulties in the optimization dynamics.

Language Modelling

Paper
Add Code

Analyzing and Improving Greedy 2-Coordinate Updates for Equality-Constrained Optimization via Steepest Descent in the 1-Norm

no code implementations • 3 Jul 2023 • Amrutha Varshini Ramesh, Aaron Mishkin, Mark Schmidt, Yihan Zhou, Jonathan Wilder Lavington, Jennifer She

We show that bound- and summation-constrained steepest descent in the L1-norm guarantees more progress per iteration than previous rules and can be computed in only $O(n \log n)$ time.

Paper
Add Code

Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be

1 code implementation • 27 Apr 2023 • Frederik Kunstner, Jacques Chen, Jonathan Wilder Lavington, Mark Schmidt

This suggests that Adam outperform SGD because it uses a more robust gradient estimate.

Paper
Code

Fast Convergence of Random Reshuffling under Over-Parameterization and the Polyak-Łojasiewicz Condition

no code implementations • 2 Apr 2023 • Chen Fan, Christos Thrampoulidis, Mark Schmidt

Modern machine learning models are often over-parameterized and as a result they can interpolate the training data.

Paper
Add Code

Simplifying Momentum-based Positive-definite Submanifold Optimization with Applications to Deep Learning

1 code implementation • 20 Feb 2023 • Wu Lin, Valentin Duruisseaux, Melvin Leok, Frank Nielsen, Mohammad Emtiyaz Khan, Mark Schmidt

Riemannian submanifold optimization with momentum is computationally challenging because, to ensure that the iterates remain on the submanifold, we often need to solve difficult differential equations.

Paper
Code

Target-based Surrogates for Stochastic Optimization

1 code implementation • 6 Feb 2023 • Jonathan Wilder Lavington, Sharan Vaswani, Reza Babanezhad, Mark Schmidt, Nicolas Le Roux

Our target optimization framework uses the (expensive) gradient computation to construct surrogate functions in a \emph{target space} (e. g. the logits output by a linear model for classification) that can be minimized efficiently.

Imitation Learning Stochastic Optimization

Paper
Code

Improved Policy Optimization for Online Imitation Learning

1 code implementation • 29 Jul 2022 • Jonathan Wilder Lavington, Sharan Vaswani, Mark Schmidt

Specifically, if the class of policies is sufficiently expressive to contain the expert policy, we prove that DAGGER achieves constant regret.

Imitation Learning

Paper
Code

Structured second-order methods via natural gradient descent

no code implementations • 22 Jul 2021 • Wu Lin, Frank Nielsen, Mohammad Emtiyaz Khan, Mark Schmidt

In this paper, we propose new structured second-order methods and structured adaptive-gradient methods obtained by performing natural-gradient descent on structured parameter spaces.

Second-order methods

Paper
Add Code

SVRG Meets AdaGrad: Painless Variance Reduction

no code implementations • 18 Feb 2021 • Benjamin Dubois-Taine, Sharan Vaswani, Reza Babanezhad, Mark Schmidt, Simon Lacoste-Julien

Variance reduction (VR) methods for finite-sum minimization typically require the knowledge of problem-dependent constants that are often unknown and difficult to estimate.

Paper
Add Code

Tractable structured natural gradient descent using local parameterizations

no code implementations • 15 Feb 2021 • Wu Lin, Frank Nielsen, Mohammad Emtiyaz Khan, Mark Schmidt

Natural-gradient descent (NGD) on structured parameter spaces (e. g., low-rank covariances) is computationally challenging due to difficult Fisher-matrix computations.

Variational Inference

Paper
Add Code

The pathway elaboration method for mean first passage time estimation in large continuous-time Markov chains with applications to nucleic acid kinetics

no code implementations • 11 Jan 2021 • Sedigheh Zolaktaf, Frits Dannenberg, Mark Schmidt, Anne Condon, Erik Winfree

We then compare the performance of pathway elaboration with the stochastic simulation algorithm (SSA) for MFPT estimation on 237 of the reactions for which SSA is feasible.

Paper
Add Code

Robust Asymmetric Learning in POMDPs

1 code implementation • 31 Dec 2020 • Andrew Warrington, J. Wilder Lavington, Adam Ścibior, Mark Schmidt, Frank Wood

Policies for partially observed Markov decision processes can be efficiently learned by imitating policies for the corresponding fully observed Markov decision processes.

Imitation Learning

Paper
Code

Homeomorphic-Invariance of EM: Non-Asymptotic Convergence in KL Divergence for Exponential Families via Mirror Descent

no code implementations • 2 Nov 2020 • Frederik Kunstner, Raunak Kumar, Mark Schmidt

In this work we first show that for the common setting of exponential family distributions, viewing EM as a mirror descent algorithm leads to convergence rates in Kullback-Leibler (KL) divergence.

Paper
Add Code

Regret Bounds without Lipschitz Continuity: Online Learning with Relative-Lipschitz Losses

no code implementations • NeurIPS 2020 • Yihan Zhou, Victor S. Portella, Mark Schmidt, Nicholas J. A. Harvey

We extend the known regret bounds for classical OCO algorithms to the relative setting.

Paper
Add Code

Variance-Reduced Methods for Machine Learning

no code implementations • 2 Oct 2020 • Robert M. Gower, Mark Schmidt, Francis Bach, Peter Richtarik

Stochastic optimization lies at the heart of machine learning, and its cornerstone is stochastic gradient descent (SGD), a method introduced over 60 years ago.

BIG-bench Machine Learning Stochastic Optimization

Paper
Add Code

Adaptive Gradient Methods Converge Faster with Over-Parameterization (and you can do a line-search)

no code implementations • 28 Sep 2020 • Sharan Vaswani, Issam H. Laradji, Frederik Kunstner, Si Yi Meng, Mark Schmidt, Simon Lacoste-Julien

Under an interpolation assumption, we prove that AMSGrad with a constant step-size and momentum can converge to the minimizer at the faster $O(1/T)$ rate for smooth, convex functions.

Binary Classification

Paper
Add Code

Adaptive Gradient Methods Converge Faster with Over-Parameterization (but you should do a line-search)

1 code implementation • 11 Jun 2020 • Sharan Vaswani, Issam Laradji, Frederik Kunstner, Si Yi Meng, Mark Schmidt, Simon Lacoste-Julien

In this setting, we prove that AMSGrad with constant step-size and momentum converges to the minimizer at a faster $O(1/T)$ rate.

Binary Classification Multi-class Classification

Paper
Code

Handling the Positive-Definite Constraint in the Bayesian Learning Rule

1 code implementation • ICML 2020 • Wu Lin, Mark Schmidt, Mohammad Emtiyaz Khan

The Bayesian learning rule is a natural-gradient variational inference method, which not only contains many existing learning algorithms as special cases but also enables the design of new algorithms.

valid Variational Inference

Paper
Code

Stein's Lemma for the Reparameterization Trick with Exponential Family Mixtures

1 code implementation • 29 Oct 2019 • Wu Lin, Mohammad Emtiyaz Khan, Mark Schmidt

Our generalization enables us to establish a connection between Stein's lemma and the reparamterization trick to derive gradients of expectations of a large class of functions under weak assumptions.

LEMMA

Paper
Code

Fast and Furious Convergence: Stochastic Second Order Methods under Interpolation

1 code implementation • 11 Oct 2019 • Si Yi Meng, Sharan Vaswani, Issam Laradji, Mark Schmidt, Simon Lacoste-Julien

Under this condition, we show that the regularized subsampled Newton method (R-SSN) achieves global linear convergence with an adaptive step-size and a constant batch-size.

Binary Classification Second-order methods

Paper
Code

xRAC: Execution and Access Control for Restricted Application Containers on Managed Hosts

1 code implementation • 8 Jul 2019 • Frederik Hauser, Mark Schmidt, Michael Menth

If the user is permitted to use the RAC on a managed host, launching the RAC is authorized and access to protected network resources may be given, e. g., to internal networks, servers, or the Internet.

Networking and Internet Architecture Cryptography and Security

Paper
Code

Where are the Masks: Instance Segmentation with Image-level Supervision

1 code implementation • 2 Jul 2019 • Issam H. Laradji, David Vazquez, Mark Schmidt

A major obstacle in instance segmentation is that existing methods often need many per-pixel labels in order to be effective.

Ranked #7 on Image-level Supervised Instance Segmentation on PASCAL VOC 2012 val (using extra training data)

Image-level Supervised Instance Segmentation Semantic Segmentation

Paper
Code

Instance Segmentation with Point Supervision

no code implementations • 14 Jun 2019 • Issam H. Laradji, Negar Rostamzadeh, Pedro O. Pinheiro, David Vazquez, Mark Schmidt

Instance segmentation methods often require costly per-pixel labels.

Instance Segmentation Object +2

Paper
Add Code

Fast and Simple Natural-Gradient Variational Inference with Mixture of Exponential-family Approximations

1 code implementation • 7 Jun 2019 • Wu Lin, Mohammad Emtiyaz Khan, Mark Schmidt

Natural-gradient methods enable fast and simple algorithms for variational inference, but due to computational difficulties, their use is mostly limited to \emph{minimal} exponential-family (EF) approximations.

Bayesian Inference Variational Inference

Paper
Code

Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

1 code implementation • NeurIPS 2019 • Sharan Vaswani, Aaron Mishkin, Issam Laradji, Mark Schmidt, Gauthier Gidel, Simon Lacoste-Julien

To improve the proposed methods' practical performance, we give heuristics to use larger step-sizes and acceleration.

General Classification Multi-class Classification

118

Paper
Code

Efficient Deep Gaussian Process Models for Variable-Sized Input

1 code implementation • 16 May 2019 • Issam H. Laradji, Mark Schmidt, Vladimir Pavlovic, Minyoung Kim

The key advantage is that the combination of GP and DRF leads to a tractable model that can both handle a variable-sized input as well as learn deep long-range dependency structures of the data.

Gaussian Processes Uncertainty Quantification

Paper
Code

Distributed Maximization of Submodular plus Diversity Functions for Multi-label Feature Selection on Huge Datasets

no code implementations • 20 Mar 2019 • Mehrdad Ghadiri, Mark Schmidt

In this paper, we consider this problem as an optimization problem that seeks to maximize the sum of a sum-sum diversity function and a non-negative monotone submodular function.

Data Summarization feature selection +1

Paper
Add Code

SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient

2 code implementations • NeurIPS 2018 • Aaron Mishkin, Frederik Kunstner, Didrik Nielsen, Mark Schmidt, Mohammad Emtiyaz Khan

Uncertainty estimation in large deep-learning models is a computationally challenging task, where it is difficult to form even a Gaussian approximation to the posterior distribution.

Variational Inference

Paper
Code

Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron

no code implementations • 16 Oct 2018 • Sharan Vaswani, Francis Bach, Mark Schmidt

Under this condition, we prove that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic accelerated method for both convex and strongly-convex functions.

Paper
Add Code

Combining Bayesian Optimization and Lipschitz Optimization

no code implementations • 10 Oct 2018 • Mohamed Osama Ahmed, Sharan Vaswani, Mark Schmidt

Indeed, in a particular setting, we prove that using the Lipschitz information yields the same or a better bound on the regret compared to using Bayesian optimization on its own.

Bayesian Optimization Thompson Sampling

Paper
Add Code

A Less Biased Evaluation of Out-of-distribution Sample Detectors

3 code implementations • 13 Sep 2018 • Alireza Shafaei, Mark Schmidt, James J. Little

What makes this problem different from a typical supervised learning setting is that the distribution of outliers used in training may not be the same as the distribution of outliers encountered in the application.

Image Classification

Paper
Code

Where are the Blobs: Counting by Localization with Point Supervision

3 code implementations • ECCV 2018 • Issam H. Laradji, Negar Rostamzadeh, Pedro O. Pinheiro, David Vazquez, Mark Schmidt

However, we propose a detection-based method that does not need to estimate the size and shape of the objects and that outperforms regression-based methods.

Ranked #1 on Object Counting on Pascal VOC 2007 count-test

Object Object Counting +1

167

Paper
Code

New Insights into Bootstrapping for Bandits

no code implementations • 24 May 2018 • Sharan Vaswani, Branislav Kveton, Zheng Wen, Anup Rao, Mark Schmidt, Yasin Abbasi-Yadkori

We investigate the use of bootstrapping in the bandit setting.

Thompson Sampling

Paper
Add Code

Let's Make Block Coordinate Descent Converge Faster: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence

1 code implementation • 23 Dec 2017 • Julie Nutini, Issam Laradji, Mark Schmidt

Block coordinate descent (BCD) methods are widely used for large-scale numerical optimization because of their cheap iteration costs, low memory requirements, amenability to parallelization, and ability to exploit problem structure.

Optimization and Control 90C06

Paper
Code

Online Learning Rate Adaptation with Hypergradient Descent

3 code implementations • ICLR 2018 • Atilim Gunes Baydin, Robert Cornish, David Martinez Rubio, Mark Schmidt, Frank Wood

We introduce a general method for improving the convergence rate of gradient-based optimizers that is easy to implement and works well in practice.

Hyperparameter Optimization Stochastic Optimization

289

Paper
Code

Horde of Bandits using Gaussian Markov Random Fields

no code implementations • 7 Mar 2017 • Sharan Vaswani, Mark Schmidt, Laks. V. S. Lakshmanan

The gang of bandits (GOB) model \cite{cesa2013gang} is a recent contextual bandits framework that shares information between a set of bandit problems, related by a known (possibly noisy) graph.

Clustering Multi-Armed Bandits +2

Paper
Add Code

Model-Independent Online Learning for Influence Maximization

no code implementations • ICML 2017 • Sharan Vaswani, Branislav Kveton, Zheng Wen, Mohammad Ghavamzadeh, Laks Lakshmanan, Mark Schmidt

We consider influence maximization (IM) in social networks, which is the problem of maximizing the number of users that become aware of a product by selecting a set of "seed" users to expose the product to.

Paper
Add Code

Fast Patch-based Style Transfer of Arbitrary Style

6 code implementations • 13 Dec 2016 • Tian Qi Chen, Mark Schmidt

This results in a procedure for artistic style transfer that is efficient but also allows arbitrary content and style images.

Image Generation Style Transfer

591

Paper
Code

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition

no code implementations • 16 Aug 2016 • Hamed Karimi, Julie Nutini, Mark Schmidt

In 1963, Polyak proposed a simple condition that is sufficient to show a global linear convergence rate for gradient descent.

Paper
Add Code

Play and Learn: Using Video Games to Train Computer Vision Models

no code implementations • 5 Aug 2016 • Alireza Shafaei, James J. Little, Mark Schmidt

We present experiments assessing the effectiveness on real-world data of systems trained on synthetic RGB images that are extracted from a video game.

Depth Estimation Domain Adaptation +3

Paper
Add Code

StopWasting My Gradients: Practical SVRG

no code implementations • NeurIPS 2015 • Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Konečný, Scott Sallinen

We present and analyze several strategies for improving the performance ofstochastic variance-reduced gradient (SVRG) methods.

Paper
Add Code

Stop Wasting My Gradients: Practical SVRG

no code implementations • 5 Nov 2015 • Reza Babanezhad, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Konečný, Scott Sallinen

We present and analyze several strategies for improving the performance of stochastic variance-reduced gradient (SVRG) methods.

Paper
Add Code

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

no code implementations • 31 Oct 2015 • Mohammad Emtiyaz Khan, Reza Babanezhad, Wu Lin, Mark Schmidt, Masashi Sugiyama

We also give a convergence-rate analysis of our method and many other previous methods which exploit the geometry of the space.

Variational Inference

Paper
Add Code

Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection

no code implementations • 1 Jun 2015 • Julie Nutini, Mark Schmidt, Issam H. Laradji, Michael Friedlander, Hoyt Koepke

There has been significant recent work on the theory and application of randomized coordinate descent algorithms, beginning with the work of Nesterov [SIAM J.

Paper
Add Code

Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields

no code implementations • 16 Apr 2015 • Mark Schmidt, Reza Babanezhad, Mohamed Osama Ahmed, Aaron Defazio, Ann Clifton, Anoop Sarkar

We apply stochastic average gradient (SAG) algorithms for training conditional random fields (CRFs).

Paper
Add Code

Influence Maximization with Bandits

no code implementations • 27 Feb 2015 • Sharan Vaswani, Laks. V. S. Lakshmanan, Mark Schmidt

We consider the problem of \emph{influence maximization}, the problem of maximizing the number of people that become aware of a product by finding the `best' set of `seed' users to expose the product to.

Paper
Add Code

Hierarchical Maximum-Margin Clustering

no code implementations • 6 Feb 2015 • Guang-Tong Zhou, Sung Ju Hwang, Mark Schmidt, Leonid Sigal, Greg Mori

We present a hierarchical maximum-margin clustering method for unsupervised data analysis.

Clustering

Paper
Add Code

Convex Optimization for Big Data

no code implementations • 4 Nov 2014 • Volkan Cevher, Stephen Becker, Mark Schmidt

This article reviews recent advances in convex optimization algorithms for Big Data, which aim to reduce the computational, storage, and communications bottlenecks.

Paper
Add Code

Minimizing Finite Sums with the Stochastic Average Gradient

2 code implementations • 10 Sep 2013 • Mark Schmidt, Nicolas Le Roux, Francis Bach

Further, in many cases the convergence rate of the new method is also faster than black-box deterministic gradient methods, in terms of the number of gradient evaluations.

Paper
Code

A Stochastic Gradient Method with an Exponential Convergence _Rate for Finite Training Sets

no code implementations • NeurIPS 2012 • Nicolas L. Roux, Mark Schmidt, Francis R. Bach

We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex.

BIG-bench Machine Learning

Paper
Add Code

Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

no code implementations • NeurIPS 2011 • Mark Schmidt, Nicolas L. Roux, Francis R. Bach

We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the second term.

Paper
Add Code

An interior-point stochastic approximation method and an L1-regularized delta rule

no code implementations • NeurIPS 2008 • Peter Carbonetto, Mark Schmidt, Nando D. Freitas

The stochastic approximation method is behind the solution to many important, actively-studied problems in machine learning.

BIG-bench Machine Learning feature selection

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.