Search Results for author: Matus Telgarsky

Found 33 papers, 4 papers with code

Spectrum Extraction and Clipping for Implicitly Linear Layers

1 code implementation • 25 Feb 2024 • Ali Ebrahimpour Boroojeny, Matus Telgarsky, Hari Sundaram

We show the effectiveness of automatic differentiation in efficiently and correctly computing and controlling the spectrum of implicitly linear operators, a rich family of layer types including all standard convolutional and dense layers.

Adversarial Robustness

Paper
Code

Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

no code implementations • 24 Feb 2024 • Jingfeng Wu, Peter L. Bartlett, Matus Telgarsky, Bin Yu

We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates.

General Classification

Paper
Add Code

Transformers, parallel computation, and logarithmic depth

1 code implementation • 14 Feb 2024 • Clayton Sanford, Daniel Hsu, Matus Telgarsky

We show that a constant number of self-attention layers can efficiently simulate, and be simulated by, a constant number of communication rounds of Massively Parallel Computation.

Paper
Code

On Achieving Optimal Adversarial Test Error

no code implementations • 13 Jun 2023 • Justin D. Li, Matus Telgarsky

We first elucidate various fundamental properties of optimal adversarial predictors: the structure of optimal adversarial convex predictors in terms of optimal adversarial zero-one predictors, bounds relating the adversarial convex loss to the adversarial zero-one loss, and the fact that continuous predictors can get arbitrarily close to the optimal adversarial error for both convex and zero-one losses.

Paper
Add Code

Feature selection with gradient descent on two-layer networks in low-rotation regimes

no code implementations • 4 Aug 2022 • Matus Telgarsky

This work establishes low test error of gradient flow (GF) and stochastic gradient descent (SGD) on two-layer ReLU networks with standard initialization, in three regimes where key sets of weights rotate little (either naturally due to GF and SGD, or due to an artificial constraint), and making use of margins as the core analytic technique.

feature selection

Paper
Add Code

Convex Analysis at Infinity: An Introduction to Astral Space

no code implementations • 6 May 2022 • Miroslav Dudík, Robert E. Schapire, Matus Telgarsky

Not all convex functions on $\mathbb{R}^n$ have finite minimizers; some can only be minimized by a sequence as it heads to infinity.

Paper
Add Code

Stochastic linear optimization never overfits with quadratically-bounded losses on general data

no code implementations • 14 Feb 2022 • Matus Telgarsky

This work provides test error bounds for iterative fixed point methods on linear predictors -- specifically, stochastic and batch mirror descent (MD), and stochastic temporal difference learning (TD) -- with two core contributions: (a) a single proof technique which gives high probability guarantees despite the absence of projections, regularization, or any equivalents, even when optima have large or infinite norm, for quadratically-bounded losses (e. g., providing unified treatment of squared and logistic losses); (b) locally-adapted rates which depend not on global problem structure (such as condition numbers and maximum margins), but rather on properties of low norm predictors which may suffer some small excess test error.

Paper
Add Code

Actor-critic is implicitly biased towards high entropy optimal policies

no code implementations • ICLR 2022 • Yuzheng Hu, Ziwei Ji, Matus Telgarsky

We show that the simplest actor-critic method -- a linear softmax policy updated with TD through interaction with a linear MDP, but featuring no explicit regularization or exploration -- does not merely find an optimal policy, but moreover prefers high entropy optimal policies.

Vocal Bursts Intensity Prediction

Paper
Add Code

Fast Margin Maximization via Dual Acceleration

no code implementations • 1 Jul 2021 • Ziwei Ji, Nathan Srebro, Matus Telgarsky

We present and analyze a momentum-based gradient method for training linear classifiers with an exponentially-tailed loss (e. g., the exponential or logistic loss), which maximizes the classification margin on separable data at a rate of $\widetilde{\mathcal{O}}(1/t^2)$.

Paper
Add Code

Early-stopped neural networks are consistent

no code implementations • NeurIPS 2021 • Ziwei Ji, Justin D. Li, Matus Telgarsky

This work studies the behavior of shallow ReLU networks trained with the logistic loss via gradient descent on binary classification data where the underlying data distribution is general, and the (optimal) Bayes risk is not necessarily zero.

Binary Classification

Paper
Add Code

Generalization bounds via distillation

no code implementations • ICLR 2021 • Daniel Hsu, Ziwei Ji, Matus Telgarsky, Lan Wang

This paper theoretically investigates the following empirical phenomenon: given a high-complexity network with poor generalization bounds, one can distill it into a network with nearly identical predictions but low complexity and vastly smaller generalization bounds.

Data Augmentation Generalization Bounds

Paper
Add Code

Gradient descent follows the regularization path for general losses

no code implementations • 19 Jun 2020 • Ziwei Ji, Miroslav Dudík, Robert E. Schapire, Matus Telgarsky

Recent work across many machine learning disciplines has highlighted that standard descent methods, even without explicit regularization, do not merely minimize the training error, but also exhibit an implicit bias.

Paper
Add Code

Directional convergence and alignment in deep learning

no code implementations • NeurIPS 2020 • Ziwei Ji, Matus Telgarsky

In this paper, we show that although the minimizers of cross-entropy and related classification losses are off at infinity, network weights learned by gradient flow converge in direction, with an immediate corollary that network predictions, training errors, and the margin distribution also converge.

General Classification

Paper
Add Code

Neural tangent kernels, transportation mappings, and universal approximation

no code implementations • ICLR 2020 • Ziwei Ji, Matus Telgarsky, Ruicheng Xian

This paper establishes rates of universal approximation for the shallow neural tangent kernel (NTK): network weights are only allowed microscopic changes from random initialization, which entails that activations are mostly unchanged, and the network is nearly equivalent to its linearization.

Paper
Add Code

Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks

no code implementations • ICLR 2020 • Ziwei Ji, Matus Telgarsky

Recent theoretical work has guaranteed that overparameterized networks trained by gradient descent achieve arbitrarily low training error, and sometimes even low test error.

Paper
Add Code

Approximation power of random neural networks

no code implementations • 18 Jun 2019 • Bolton Bailey, Ziwei Ji, Matus Telgarsky, Ruicheng Xian

This paper investigates the approximation power of three types of random neural networks: (a) infinite width networks, with weights following an arbitrary distribution; (b) finite width networks obtained by subsampling the preceding infinite width networks; (c) finite width networks obtained by starting with standard Gaussian initialization, and then adding a vanishingly small correction to the weights.

Paper
Add Code

Characterizing the implicit bias via a primal-dual analysis

no code implementations • 11 Jun 2019 • Ziwei Ji, Matus Telgarsky

On the other hand, with a properly chosen but aggressive step size schedule, we prove $O(1/t)$ rates for both $\ell_2$ margin maximization and implicit bias, whereas prior work (including all first-order methods for the general hard-margin linear SVM problem) proved $\widetilde{O}(1/\sqrt{t})$ margin rates, or $O(1/t)$ margin rates to a suboptimal margin, with an implied (slower) bias rate.

Paper
Add Code

A gradual, semi-discrete approach to generative network training via explicit Wasserstein minimization

no code implementations • 8 Jun 2019 • Yu-cheng Chen, Matus Telgarsky, Chao Zhang, Bolton Bailey, Daniel Hsu, Jian Peng

This paper provides a simple procedure to fit generative networks to target distributions, with the goal of a small Wasserstein distance (or other optimal transport costs).

Paper
Add Code

Size-Noise Tradeoffs in Generative Networks

no code implementations • NeurIPS 2018 • Bolton Bailey, Matus Telgarsky

This paper investigates the ability of generative networks to convert their input noise distributions into other distributions.

Paper
Add Code

Gradient descent aligns the layers of deep linear networks

no code implementations • ICLR 2019 • Ziwei Ji, Matus Telgarsky

This paper establishes risk convergence and asymptotic weight matrix alignment --- a form of implicit regularization --- of gradient flow and gradient descent when applied to deep linear networks on linearly separable data.

Paper
Add Code

Risk and parameter convergence of logistic regression

no code implementations • 20 Mar 2018 • Ziwei Ji, Matus Telgarsky

Gradient descent, when applied to the task of logistic regression, outputs iterates which are biased to follow a unique ray defined by the data.

regression

Paper
Add Code

Social welfare and profit maximization from revealed preferences

no code implementations • 6 Nov 2017 • Ziwei Ji, Ruta Mehta, Matus Telgarsky

Consider the seller's problem of finding optimal prices for her $n$ (divisible) goods when faced with a set of $m$ consumers, given that she can only observe their purchased bundles at posted prices, i. e., revealed preferences.

Paper
Add Code

Spectrally-normalized margin bounds for neural networks

1 code implementation • NeurIPS 2017 • Peter Bartlett, Dylan J. Foster, Matus Telgarsky

This paper presents a margin-based multiclass generalization bound for neural networks that scales with their margin-normalized "spectral complexity": their Lipschitz constant, meaning the product of the spectral norms of the weight matrices, times a certain correction factor.

Paper
Code

Neural networks and rational functions

no code implementations • ICML 2017 • Matus Telgarsky

Neural networks and rational functions efficiently approximate each other.

Paper
Add Code

Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis

no code implementations • 13 Feb 2017 • Maxim Raginsky, Alexander Rakhlin, Matus Telgarsky

Stochastic Gradient Langevin Dynamics (SGLD) is a popular variant of Stochastic Gradient Descent, where properly scaled isotropic Gaussian noise is added to an unbiased estimate of the gradient at each iteration.

Paper
Add Code

Greedy bi-criteria approximations for $k$-medians and $k$-means

no code implementations • 21 Jul 2016 • Daniel Hsu, Matus Telgarsky

This paper investigates the following natural greedy procedure for clustering in the bi-criterion setting: iteratively grow a set of centers, in each round adding the center from a candidate set that maximally decreases clustering cost.

Clustering

Paper
Add Code

Benefits of depth in neural networks

no code implementations • 14 Feb 2016 • Matus Telgarsky

For any positive integer $k$, there exist neural networks with $\Theta(k^3)$ layers, $\Theta(1)$ nodes per layer, and $\Theta(1)$ distinct parameters which can not be approximated by networks with $\mathcal{O}(k)$ layers unless they are exponentially large --- they must possess $\Omega(2^k)$ nodes.

Paper
Add Code

Representation Benefits of Deep Feedforward Networks

no code implementations • 27 Sep 2015 • Matus Telgarsky

This note provides a family of classification problems, indexed by a positive integer $k$, where all shallow networks with fewer than exponentially (in $k$) many nodes exhibit error at least $1/6$, whereas a deep network with 2 nodes in each of $2k$ layers achieves zero error, as does a recurrent network with 3 distinct nodes iterated $k$ times.

2k General Classification

Paper
Add Code

Convex Risk Minimization and Conditional Probability Estimation

no code implementations • 15 Jun 2015 • Matus Telgarsky, Miroslav Dudík, Robert Schapire

This paper proves, in very general settings, that convex risk minimization is a procedure to select a unique conditional probability model determined by the classification problem.

General Classification

Paper
Add Code

Scalable Nonlinear Learning with Adaptive Polynomial Expansions

no code implementations • 2 Oct 2014 • Alekh Agarwal, Alina Beygelzimer, Daniel Hsu, John Langford, Matus Telgarsky

Can we effectively learn a nonlinear representation in time comparable to linear learning?

Computational Efficiency

Paper
Add Code

Moment-based Uniform Deviation Bounds for $k$-means and Friends

1 code implementation • 8 Nov 2013 • Matus Telgarsky, Sanjoy Dasgupta

Suppose $k$ centers are fit to $m$ points by heuristically minimizing the $k$-means cost; what is the corresponding fit over the source distribution?

Clustering

Paper
Code

Boosting with the Logistic Loss is Consistent

no code implementations • 13 May 2013 • Matus Telgarsky

This manuscript provides optimization guarantees, generalization bounds, and statistical consistency results for AdaBoost variants which replace the exponential loss with the logistic and similar losses (specifically, twice differentiable convex losses which are Lipschitz and tend to zero on one side).

Generalization Bounds

Paper
Add Code

Tensor decompositions for learning latent variable models

no code implementations • 29 Oct 2012 • Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, Matus Telgarsky

This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models---including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation---which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order).

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.