Search Results for author: Konstantin Mishchenko

Found 30 papers, 12 papers with code

When, Why and How Much? Adaptive Learning Rate Scheduling by Refinement

1 code implementation • 11 Oct 2023 • Aaron Defazio, Ashok Cutkosky, Harsh Mehta, Konstantin Mishchenko

To go beyond this worst-case analysis, we use the observed gradient norms to derive schedules refined for any particular task.

Scheduling

Paper
Code

Adaptive Proximal Gradient Method for Convex Optimization

1 code implementation • 4 Aug 2023 • Yura Malitsky, Konstantin Mishchenko

In this paper, we explore two fundamental first-order algorithms in convex optimization, namely, gradient descent (GD) and proximal gradient method (ProxGD).

Paper
Code

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

1 code implementation • 9 Jun 2023 • Konstantin Mishchenko, Aaron Defazio

We consider the problem of estimating the learning rate in adaptive methods, such as AdaGrad and Adam.

256

Paper
Code

Partially Personalized Federated Learning: Breaking the Curse of Data Heterogeneity

no code implementations • 29 May 2023 • Konstantin Mishchenko, Rustem Islamov, Eduard Gorbunov, Samuel Horváth

We present a partially personalized formulation of Federated Learning (FL) that strikes a balance between the flexibility of personalization and cooperativeness of global training.

Personalized Federated Learning

Paper
Add Code

DoWG Unleashed: An Efficient Universal Parameter-Free Gradient Descent Method

1 code implementation • NeurIPS 2023 • Ahmed Khaled, Konstantin Mishchenko, Chi Jin

This paper proposes a new easy-to-implement parameter-free gradient-based optimizer: DoWG (Distance over Weighted Gradients).

Paper
Code

Two Losses Are Better Than One: Faster Optimization Using a Cheaper Proxy

no code implementations • 7 Feb 2023 • Blake Woodworth, Konstantin Mishchenko, Francis Bach

We present an algorithm for minimizing an objective with hard-to-compute gradients by using a related, easier-to-access function as a proxy.

Paper
Add Code

Learning-Rate-Free Learning by D-Adaptation

1 code implementation • 18 Jan 2023 • Aaron Defazio, Konstantin Mishchenko

D-Adaptation is an approach to automatically setting the learning rate which asymptotically achieves the optimal rate of convergence for minimizing convex Lipschitz functions, with no back-tracking or line searches, and no additional function value or gradient evaluations per step.

485

Paper
Code

Convergence of First-Order Algorithms for Meta-Learning with Moreau Envelopes

no code implementations • 17 Jan 2023 • Konstantin Mishchenko, Slavomír Hanzely, Peter Richtárik

As a special case, our theory allows us to show the convergence of First-Order Model-Agnostic Meta-Learning (FO-MAML) to the vicinity of a solution of Moreau objective.

Meta-Learning Personalized Federated Learning

Paper
Add Code

Super-Universal Regularized Newton Method

no code implementations • 11 Aug 2022 • Nikita Doikov, Konstantin Mishchenko, Yurii Nesterov

We analyze the performance of a variant of Newton method with quadratic regularization for solving composite convex minimization problems.

Paper
Add Code

Adaptive Learning Rates for Faster Stochastic Gradient Methods

no code implementations • 10 Aug 2022 • Samuel Horváth, Konstantin Mishchenko, Peter Richtárik

In this work, we propose new adaptive step size strategies that improve several stochastic gradient methods.

Stochastic Optimization

Paper
Add Code

Asynchronous SGD Beats Minibatch SGD Under Arbitrary Delays

1 code implementation • 15 Jun 2022 • Konstantin Mishchenko, Francis Bach, Mathieu Even, Blake Woodworth

The existing analysis of asynchronous stochastic gradient descent (SGD) degrades dramatically when any delay is large, giving the impression that performance depends primarily on the delay.

Paper
Code

ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally!

no code implementations • 18 Feb 2022 • Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich, Peter Richtárik

The canonical approach to solving such problems is via the proximal gradient descent (ProxGD) algorithm, which is based on the evaluation of the gradient of $f$ and the prox operator of $\psi$ in each iteration.

Federated Learning

Paper
Add Code

Server-Side Stepsizes and Sampling Without Replacement Provably Help in Federated Optimization

no code implementations • 26 Jan 2022 • Grigory Malinovsky, Konstantin Mishchenko, Peter Richtárik

Together, our results on the advantage of large and small server-side stepsizes give a formal justification for the practice of adaptive server-side optimization in federated learning.

Federated Learning

Paper
Add Code

Regularized Newton Method with Global $O(1/k^2)$ Convergence

2 code implementations • 3 Dec 2021 • Konstantin Mishchenko

We present a Newton-type method that converges fast from any initialization and for arbitrary convex objectives with Lipschitz Hessians.

Paper
Code

IntSGD: Adaptive Floatless Compression of Stochastic Gradients

1 code implementation • ICLR 2022 • Konstantin Mishchenko, Bokun Wang, Dmitry Kovalev, Peter Richtárik

We propose a family of adaptive integer compression operators for distributed Stochastic Gradient Descent (SGD) that do not communicate a single float.

Paper
Code

Proximal and Federated Random Reshuffling

1 code implementation • NeurIPS 2021 • Konstantin Mishchenko, Ahmed Khaled, Peter Richtárik

Random Reshuffling (RR), also known as Stochastic Gradient Descent (SGD) without replacement, is a popular and theoretically grounded method for finite-sum minimization.

Paper
Code

Random Reshuffling: Simple Analysis with Vast Improvements

1 code implementation • NeurIPS 2020 • Konstantin Mishchenko, Ahmed Khaled, Peter Richtárik

from $\kappa$ to $\sqrt{\kappa}$) and, in addition, show that RR has a different type of variance.

Paper
Code

Dualize, Split, Randomize: Toward Fast Nonsmooth Optimization Algorithms

no code implementations • 3 Apr 2020 • Adil Salim, Laurent Condat, Konstantin Mishchenko, Peter Richtárik

We consider minimizing the sum of three convex functions, where the first one F is smooth, the second one is nonsmooth and proximable and the third one is the composition of a nonsmooth proximable function with a linear operator L. This template problem has many applications, for instance, in image processing and machine learning.

Paper
Add Code

Stochastic Newton and Cubic Newton Methods with Simple Local Linear-Quadratic Rates

1 code implementation • 3 Dec 2019 • Dmitry Kovalev, Konstantin Mishchenko, Peter Richtárik

We present two new remarkably simple stochastic second-order methods for minimizing the average of a very large number of sufficiently smooth and strongly convex functions.

Second-order methods

Paper
Code

Adaptive Gradient Descent without Descent

1 code implementation • ICML 2020 • Yura Malitsky, Konstantin Mishchenko

We present a strikingly simple proof that two rules are sufficient to automate gradient descent: 1) don't increase the stepsize too fast and 2) don't overstep the local curvature.

Paper
Code

Sinkhorn Algorithm as a Special Case of Stochastic Mirror Descent

no code implementations • 16 Sep 2019 • Konstantin Mishchenko

We present a new perspective on the celebrated Sinkhorn algorithm by showing that is a special case of incremental/stochastic mirror descent.

Paper
Add Code

Tighter Theory for Local SGD on Identical and Heterogeneous Data

no code implementations • 10 Sep 2019 • Ahmed Khaled, Konstantin Mishchenko, Peter Richtárik

We provide a new analysis of local SGD, removing unnecessary assumptions and elaborating on the difference between two data regimes: identical and heterogeneous.

Paper
Add Code

First Analysis of Local GD on Heterogeneous Data

no code implementations • 10 Sep 2019 • Ahmed Khaled, Konstantin Mishchenko, Peter Richtárik

We provide the first convergence analysis of local gradient descent for minimizing the average of smooth and convex but otherwise arbitrary functions.

Federated Learning

Paper
Add Code

A Self-supervised Approach to Hierarchical Forecasting with Applications to Groupwise Synthetic Controls

no code implementations • 25 Jun 2019 • Konstantin Mishchenko, Mallory Montgomery, Federico Vaggi

When forecasting time series with a hierarchical structure, the existing state of the art is to forecast each time series independently, and, in a post-treatment step, to reconcile the time series in a way that respects the hierarchy (Hyndman et al., 2011; Wickramasuriya et al., 2018).

counterfactual Time Series +1

Paper
Add Code

Revisiting Stochastic Extragradient

no code implementations • 27 May 2019 • Konstantin Mishchenko, Dmitry Kovalev, Egor Shulgin, Peter Richtárik, Yura Malitsky

We fix a fundamental issue in the stochastic extragradient method by providing a new sampling strategy that is motivated by approximating implicit updates.

Paper
Add Code

99% of Distributed Optimization is a Waste of Time: The Issue and How to Fix it

no code implementations • 27 Jan 2019 • Konstantin Mishchenko, Filip Hanzely, Peter Richtárik

We propose a fix based on a new update-sparsification method we develop in this work, which we suggest be used on top of existing methods.

Distributed Optimization

Paper
Add Code

Distributed Learning with Compressed Gradient Differences

no code implementations • 26 Jan 2019 • Konstantin Mishchenko, Eduard Gorbunov, Martin Takáč, Peter Richtárik

Our analysis of block-quantization and differences between $\ell_2$ and $\ell_{\infty}$ quantization closes the gaps in theory and practice.

Distributed Computing Quantization

Paper
Add Code

SEGA: Variance Reduction via Gradient Sketching

no code implementations • NeurIPS 2018 • Filip Hanzely, Konstantin Mishchenko, Peter Richtarik

In each iteration, SEGA updates the current estimate of the gradient through a sketch-and-project operation using the information provided by the latest sketch, and this is subsequently used to compute an unbiased estimate of the true gradient through a random relaxation procedure.

Paper
Add Code

A Delay-tolerant Proximal-Gradient Algorithm for Distributed Learning

no code implementations • ICML 2018 • Konstantin Mishchenko, Franck Iutzeler, Jérôme Malick, Massih-Reza Amini

One of the main challenges is then to deal with heterogeneous machines and unreliable communications.

Paper
Add Code

A Distributed Flexible Delay-tolerant Proximal Gradient Algorithm

no code implementations • 25 Jun 2018 • Konstantin Mishchenko, Franck Iutzeler, Jérôme Malick

We develop and analyze an asynchronous algorithm for distributed convex optimization when the objective writes a sum of smooth functions, local to each worker, and a non-smooth function.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.