no code implementations • ICML 2020 • Jenny Hamer, Mehryar Mohri, Ananda Theertha Suresh
We provide communication-efficient ensemble algorithms for federated learning, where per-round communication cost is independent of the size of the ensemble.
no code implementations • 14 Apr 2024 • Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton
Despite the remarkable strides made by autoregressive language models, their potential is often hampered by the slow inference speeds inherent in sequential token generation.
no code implementations • 2 Apr 2024 • Joy Qiping Yang, Salman Salamatian, Ziteng Sun, Ananda Theertha Suresh, Ahmad Beirami
The goal of language model alignment is to alter $p$ to a new distribution $\phi$ that results in a higher expected reward while keeping $\phi$ close to $p.$ A popular alignment method is the KL-constrained reinforcement learning (RL), which chooses a distribution $\phi_\Delta$ that maximizes $E_{\phi_{\Delta}} r(y)$ subject to a relative entropy constraint $KL(\phi_\Delta || p) \leq \Delta.$ Another simple alignment method is best-of-$N$, where $N$ samples are drawn from $p$ and one with highest reward is selected.
no code implementations • 15 Mar 2024 • Ziteng Sun, Jae Hun Ro, Ahmad Beirami, Ananda Theertha Suresh
To the best of our knowledge, our work is the first to establish improvement over speculative decoding through a better draft verification algorithm.
no code implementations • 12 Mar 2024 • Jae Hun Ro, Srinadh Bhojanapalli, Zheng Xu, Yanxiang Zhang, Ananda Theertha Suresh
Cross-device federated learning (FL) is a technique that trains a model on data distributed across typically millions of edge devices without data leaving the devices.
no code implementations • 3 Jan 2024 • Ahmad Beirami, Alekh Agarwal, Jonathan Berant, Alexander D'Amour, Jacob Eisenstein, Chirag Nagpal, Ananda Theertha Suresh
A commonly used analytical expression in the literature claims that the KL divergence between the best-of-$n$ policy and the base policy is equal to $\log (n) - (n-1)/n.$ We disprove the validity of this claim, and show that it is an upper bound on the actual KL divergence.
no code implementations • 11 Dec 2023 • Alex Kulesza, Ananda Theertha Suresh, Yuyan Wang
We propose a new algorithm and show that it is min-max optimal, achieving the best possible constant in the leading term of the mean squared error for all $\epsilon$, and that this constant is the same as the optimal algorithm under the swap model.
no code implementations • 6 Dec 2023 • Lucas Monteiro Paes, Ananda Theertha Suresh, Alex Beutel, Flavio P. Calmon, Ahmad Beirami
Here, the sample complexity for estimating the worst-case performance gap across groups (e. g., the largest difference in error rates) increases exponentially with the number of group-denoting sensitive attributes.
no code implementations • NeurIPS 2023 • Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, Felix Yu
We show that the optimal draft selection algorithm (transport plan) can be computed via linear programming, whose best-known runtime is exponential in $k$.
no code implementations • 19 Jul 2023 • Ziteng Sun, Ananda Theertha Suresh, Aditya Krishna Menon
Training machine learning models with differential privacy (DP) has received increasing interest in recent years.
no code implementations • 10 Jul 2023 • Xuechen Zhang, Mingchen Li, Xiangyu Chang, Jiasi Chen, Amit K. Roy-Chowdhury, Ananda Theertha Suresh, Samet Oymak
These insights on scale and modularity motivate a new federated learning approach we call "You Only Load Once" (FedYolo): The clients load a full PTF model once and all future updates are accomplished through communication-efficient modules with limited catastrophic-forgetting, where each task is assigned to its own module.
no code implementations • 1 Mar 2023 • Travis Dick, Alex Kulesza, Ziteng Sun, Ananda Theertha Suresh
We propose a new definition of instance optimality for differentially private estimation algorithms.
no code implementations • 14 Feb 2023 • Clément L. Canonne, Ziteng Sun, Ananda Theertha Suresh
We study the problem of discrete distribution estimation in KL divergence and provide concentration bounds for the Laplace estimator.
no code implementations • 12 Aug 2022 • Raef Bassily, Mehryar Mohri, Ananda Theertha Suresh
A key problem in a variety of applications is that of domain adaptation from a public source domain, for which a relatively large amount of labeled data with no privacy constraints is at one's disposal, to a private target domain, for which a private sample is available with very few or no labeled data.
no code implementations • 7 Jun 2022 • YuHan Liu, Ananda Theertha Suresh, Wennan Zhu, Peter Kairouz, Marco Gruteser
In this scenario, the amount of noise injected into the histogram to obtain differential privacy is proportional to the maximum user contribution, which can be amplified by few outliers.
no code implementations • 21 Apr 2022 • Raef Bassily, Mehryar Mohri, Ananda Theertha Suresh
For the family of linear hypotheses, we give a pure DP learning algorithm that benefits from relative deviation margin guarantees, as well as an efficient DP learning algorithm with margin guarantees.
no code implementations • FL4NLP (ACL) 2022 • Jae Hun Ro, Theresa Breiner, Lara McConnaughey, Mingqing Chen, Ananda Theertha Suresh, Shankar Kumar, Rajiv Mathews
Most studies in cross-device federated learning focus on small models, due to the server-client communication and on-device computation bottlenecks.
no code implementations • 9 Mar 2022 • Ananda Theertha Suresh, Ziteng Sun, Jae Hun Ro, Felix Yu
We show that applying the proposed protocol as sub-routine in distributed optimization algorithms leads to better convergence rates.
no code implementations • 7 Mar 2022 • Wei-Ning Chen, Christopher A. Choquette-Choo, Peter Kairouz, Ananda Theertha Suresh
We consider the problem of training a $d$ dimensional model with distributed differential privacy (DP) where secure aggregation (SecAgg) is used to ensure that the server only sees the noisy sum of $n$ model updates in every training round.
no code implementations • NeurIPS 2021 • Corinna Cortes, Mehryar Mohri, Dmitry Storcheus, Ananda Theertha Suresh
We study the problem of learning accurate ensemble predictors, in particular boosting, in the presence of multiple source domains.
no code implementations • NeurIPS 2021 • Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian U. Stich, Ananda Theertha Suresh
Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon.
no code implementations • 9 Nov 2021 • Jayadev Acharya, Ayush Jain, Gautam Kamath, Ananda Theertha Suresh, Huanyu Zhang
We study the problem of robustly estimating the parameter $p$ of an Erd\H{o}s-R\'enyi random graph on $n$ nodes, where a $\gamma$ fraction of nodes may be adversarially corrupted.
no code implementations • 28 Oct 2021 • Wittawat Jitkrittum, Michal Lukasik, Ananda Theertha Suresh, Felix Yu, Gang Wang
In this paper, we study training and inference of neural networks under the MPC setup.
no code implementations • 29 Sep 2021 • Wittawat Jitkrittum, Michal Lukasik, Ananda Theertha Suresh, Felix Yu, Gang Wang
In this paper, we study training and inference of neural networks under the MPC setup.
1 code implementation • 4 Aug 2021 • Jae Hun Ro, Ananda Theertha Suresh, Ke wu
Federated learning is a machine learning technique that enables training across decentralized data.
2 code implementations • 14 Jul 2021 • Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H. Brendan McMahan, Blaise Aguera y Arcas, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, Suhas Diggavi, Hubert Eichner, Advait Gadhikar, Zachary Garrett, Antonious M. Girgis, Filip Hanzely, Andrew Hard, Chaoyang He, Samuel Horvath, Zhouyuan Huo, Alex Ingerman, Martin Jaggi, Tara Javidi, Peter Kairouz, Satyen Kale, Sai Praneeth Karimireddy, Jakub Konecny, Sanmi Koyejo, Tian Li, Luyang Liu, Mehryar Mohri, Hang Qi, Sashank J. Reddi, Peter Richtarik, Karan Singhal, Virginia Smith, Mahdi Soltanolkotabi, Weikang Song, Ananda Theertha Suresh, Sebastian U. Stich, Ameet Talwalkar, Hongyi Wang, Blake Woodworth, Shanshan Wu, Felix X. Yu, Honglin Yuan, Manzil Zaheer, Mi Zhang, Tong Zhang, Chunxiang Zheng, Chen Zhu, Wennan Zhu
Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection.
no code implementations • ICLR 2022 • Pranjal Awasthi, Abhimanyu Das, Rajat Sen, Ananda Theertha Suresh
We also demonstrate empirically that our method instantiated with a well-designed general purpose mixture likelihood family can obtain superior performance for a variety of tasks across time-series forecasting and regression datasets with different data distributions.
no code implementations • 11 May 2021 • Antonious M. Girgis, Deepesh Data, Suhas Diggavi, Ananda Theertha Suresh, Peter Kairouz
The central question studied in this paper is Renyi Differential Privacy (RDP) guarantees for general discrete local mechanisms in the shuffle privacy model.
no code implementations • 6 Apr 2021 • Jae Ro, Mingqing Chen, Rajiv Mathews, Mehryar Mohri, Ananda Theertha Suresh
We propose a communication-efficient distributed algorithm called Agnostic Federated Averaging (or AgnosticFedAvg) to minimize the domain-agnostic objective proposed in Mohri et al. (2019), which is amenable to other private mechanisms such as secure aggregation.
no code implementations • NeurIPS 2021 • Ayush Sekhari, Jayadev Acharya, Gautam Kamath, Ananda Theertha Suresh
We study the problem of unlearning datapoints from a learnt model.
no code implementations • NeurIPS 2021 • Daniel Levy, Ziteng Sun, Kareem Amin, Satyen Kale, Alex Kulesza, Mehryar Mohri, Ananda Theertha Suresh
We show that for high-dimensional mean estimation, empirical risk minimization with smooth losses, stochastic convex optimization, and learning hypothesis classes with finite metric entropy, the privacy cost decreases as $O(1/\sqrt{m})$ as users provide more samples.
1 code implementation • 24 Nov 2020 • Prathamesh Mayekar, Shubham Jha, Ananda Theertha Suresh, Himanshu Tyagi
We propose \emph{Wyner-Ziv estimators}, which are communication and computationally efficient and near-optimal when an upper bound for the distance between the side information and the data is known.
no code implementations • 3 Nov 2020 • Ananda Theertha Suresh
We propose a simple robust hypothesis test that has the same sample complexity as that of the optimal Neyman-Pearson test up to constants, but robust to distribution perturbations under Hellinger distance.
no code implementations • 25 Aug 2020 • Corinna Cortes, Mehryar Mohri, Ananda Theertha Suresh, Ningshan Zhang
We present a new discriminative technique for the multiple-source adaptation, MSA, problem.
no code implementations • 17 Aug 2020 • Antonious M. Girgis, Deepesh Data, Suhas Diggavi, Peter Kairouz, Ananda Theertha Suresh
We consider a distributed empirical risk minimization (ERM) optimization problem with communication efficiency and privacy requirements, motivated by the federated learning (FL) framework.
1 code implementation • 8 Aug 2020 • Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh
Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon.
no code implementations • NeurIPS 2020 • Yuhan Liu, Ananda Theertha Suresh, Felix Yu, Sanjiv Kumar, Michael Riley
If each user has $m$ samples, we show that straightforward applications of Laplace or Gaussian mechanisms require the number of users to be $\mathcal{O}(k/(m\alpha^2) + k/\epsilon\alpha)$ to achieve an $\ell_1$ distance of $\alpha$ between the true and estimated distributions, with the privacy-induced penalty $k/\epsilon\alpha$ independent of the number of samples per user $m$.
no code implementations • 19 Jul 2020 • Yishay Mansour, Mehryar Mohri, Jae Ro, Ananda Theertha Suresh, Ke wu
We present a theoretical and algorithmic study of the multiple-source domain adaptation problem in the common scenario where the learner has access only to a limited amount of labeled target data, but where the learner has at disposal a large amount of labeled data from multiple source domains.
no code implementations • 26 Jun 2020 • Corinna Cortes, Mehryar Mohri, Ananda Theertha Suresh
We present a series of new and more favorable margin-based learning guarantees that depend on the empirical margin loss of a predictor.
1 code implementation • 25 Feb 2020 • Yishay Mansour, Mehryar Mohri, Jae Ro, Ananda Theertha Suresh
The standard objective in machine learning is to train a single model for all users.
8 code implementations • 10 Dec 2019 • Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D'Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Konečný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, Sen Zhao
FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches.
no code implementations • 18 Nov 2019 • Ziteng Sun, Peter Kairouz, Ananda Theertha Suresh, H. Brendan McMahan
This paper focuses on backdoor attacks in the federated learning setting, where the goal of the adversary is to reduce the performance of the model on targeted tasks while maintaining good performance on the main task.
7 code implementations • ICML 2020 • Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh
We obtain tight convergence rates for FedAvg and prove that it suffers from `client-drift' when the data is heterogeneous (non-iid), resulting in unstable and slow convergence.
no code implementations • CONLL 2019 • Mingqing Chen, Ananda Theertha Suresh, Rajiv Mathews, Adeline Wong, Cyril Allauzen, Françoise Beaufays, Michael Riley
The n-gram language models trained with federated learning are compared to n-grams trained with traditional server-based algorithms using A/B tests on tens of millions of users of virtual keyboard.
no code implementations • NeurIPS 2019 • Ananda Theertha Suresh
For a dataset of label-count pairs, an anonymized histogram is the multiset of counts.
1 code implementation • 20 Aug 2019 • Venkatadheeraj Pichapati, Ananda Theertha Suresh, Felix X. Yu, Sashank J. Reddi, Sanjiv Kumar
Motivated by this, differentially private stochastic gradient descent (SGD) algorithms for training machine learning models have been proposed.
no code implementations • 8 Aug 2019 • Jayadev Acharya, Ananda Theertha Suresh
A primary concern of excessive reuse of test datasets in machine learning is that it can lead to overfitting.
no code implementations • NeurIPS 2019 • Ankit Singh Rawat, Jiecao Chen, Felix Yu, Ananda Theertha Suresh, Sanjiv Kumar
For the settings where a large number of classes are involved, a common method to speed up training is to sample a subset of classes and utilize an estimate of the loss gradient based on these classes, known as the sampled softmax method.
no code implementations • CL (ACL) 2021 • Ananda Theertha Suresh, Brian Roark, Michael Riley, Vlad Schogol
Weighted finite automata (WFA) are often used to represent probabilistic models, such as $n$-gram language models, since they are efficient for recognition tasks in time and space.
6 code implementations • 1 Feb 2019 • Mehryar Mohri, Gary Sivek, Ananda Theertha Suresh
A key learning scenario in large-scale applications is that of federated learning, where a centralized model is trained based on data originating from a large number of clients.
no code implementations • 20 Nov 2018 • Ehsan Variani, Ananda Theertha Suresh, Mitchel Weintraub
Most of the parameters in large vocabulary models are used in embedding layer to map categorical features to vectors and in softmax layer for classification weights.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • NeurIPS 2018 • Naman Agarwal, Ananda Theertha Suresh, Felix Yu, Sanjiv Kumar, H. Brendan McMahan
Distributed stochastic gradient descent is an important subroutine in distributed learning.
no code implementations • NeurIPS 2017 • Xiang Wu, Ruiqi Guo, Ananda Theertha Suresh, Sanjiv Kumar, Daniel N. Holtmann-Rice, David Simcha, Felix Yu
We propose a multiscale quantization approach for fast similarity search on large, high-dimensional datasets.
no code implementations • 15 Nov 2017 • Shankar Kumar, Michael Nirschl, Daniel Holtmann-Rice, Hank Liao, Ananda Theertha Suresh, Felix Yu
Recurrent neural network (RNN) language models (LMs) and Long Short Term Memory (LSTM) LMs, a variant of RNN LMs, have been shown to outperform traditional N-gram LMs on speech recognition tasks.
1 code implementation • NeurIPS 2017 • Rajat Sen, Ananda Theertha Suresh, Karthikeyan Shanmugam, Alexandros G. Dimakis, Sanjay Shakkottai
We consider the problem of non-parametric Conditional Independence testing (CI testing) for continuous random variables.
no code implementations • ICML 2017 • Jayadev Acharya, Hirakendu Das, Alon Orlitsky, Ananda Theertha Suresh
Symmetric distribution properties such as support size, support coverage, entropy, and proximity to uniformity, arise in many applications.
no code implementations • ICML 2017 • Moein Falahatgar, Alon Orlitsky, Venkatadheeraj Pichapati, Ananda Theertha Suresh
We consider $(\epsilon,\delta)$-PAC maximum-selection and ranking for general probabilistic models whose comparisons probabilities satisfy strong stochastic transitivity and stochastic triangle inequality.
no code implementations • 18 Feb 2017 • Yury Polyanskiy, Ananda Theertha Suresh, Yihong Wu
For noisy population recovery, the sharp sample complexity turns out to be more sensitive to dimension and scales as $\exp(\Theta(d^{1/3} \log^{2/3}(1/\delta)))$ except for the trivial cases of $\epsilon=0, 1/2$ or $1$.
no code implementations • 9 Nov 2016 • Jayadev Acharya, Hirakendu Das, Alon Orlitsky, Ananda Theertha Suresh
The advent of data science has spurred interest in estimating properties of distributions over large alphabets.
no code implementations • ICML 2017 • Ananda Theertha Suresh, Felix X. Yu, Sanjiv Kumar, H. Brendan McMahan
Motivated by the need for distributed learning and optimization algorithms with low communication cost, we study communication efficient algorithms for distributed mean estimation.
no code implementations • NeurIPS 2016 • Felix X. Yu, Ananda Theertha Suresh, Krzysztof Choromanski, Daniel Holtmann-Rice, Sanjiv Kumar
We present an intriguing discovery related to Random Fourier Features: in Gaussian kernel approximation, replacing the random Gaussian matrix by a properly scaled random orthogonal matrix significantly decreases kernel approximation error.
no code implementations • ICLR 2018 • Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, Dave Bacon
We consider learning algorithms for this setting where on each round, each client independently computes an update to the current model based on its local data, and communicates this update to a central server, where the client-side updates are aggregated to compute a new global model.
no code implementations • NeurIPS 2015 • Alon Orlitsky, Ananda Theertha Suresh
Second, they estimate every distribution nearly as well as the best estimator designed with prior knowledge of the exact distribution, but as all natural estimators, restricted to assign the same probability to all symbols appearing the same number of times. Specifically, for distributions over $k$ symbols and $n$ samples, we show that for both comparisons, a simple variant of Good-Turing estimator is always within KL divergence of $(3+o(1))/n^{1/3}$ from the best estimator, and that a more involved estimator is within $\tilde{\mathcal{O}}(\min(k/n, 1/\sqrt n))$.
no code implementations • 23 Nov 2015 • Alon Orlitsky, Ananda Theertha Suresh, Yihong Wu
We derive a class of estimators that $\textit{provably}$ predict $U$ not just for constant $t>1$, but all the way up to $t$ proportional to $\log n$.
no code implementations • 16 Apr 2015 • Moein Falahatgar, Ashkan Jafarpour, Alon Orlitsky, Venkatadheeraj Pichapathi, Ananda Theertha Suresh
There has been considerable recent interest in distribution-tests whose run-time and sample requirements are sublinear in the domain-size $k$.
no code implementations • 27 Mar 2015 • Alon Orlitsky, Ananda Theertha Suresh
We also provide an estimator that runs in linear time and incurs competitive regret of $\tilde{\mathcal{O}}(\min(k/n, 1/\sqrt n))$, and show that for natural estimators this competitive regret is inevitable.
no code implementations • 7 Jan 2015 • Aditya Bhaskara, Ananda Theertha Suresh, Morteza Zadimoghaddam
For learning a mixture of $k$ axis-aligned Gaussians in $d$ dimensions, we give an algorithm that outputs a mixture of $O(k/\epsilon^3)$ Gaussians that is $\epsilon$-close in statistical distance to the true distribution, without any separation assumptions.
no code implementations • 2 Aug 2014 • Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, Himanshu Tyagi
It was recently shown that estimating the Shannon entropy $H({\rm p})$ of a discrete $k$-symbol distribution ${\rm p}$ requires $\Theta(k/\log k)$ samples, a number that grows near-linearly in the support size.
no code implementations • 29 May 2014 • Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, Ananda Theertha Suresh
The Poisson-sampling technique eliminates dependencies among symbol appearances in a random sequence.
no code implementations • NeurIPS 2014 • Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, Ananda Theertha Suresh
For mixtures of any $k$ $d$-dimensional spherical Gaussians, we derive an intuitive spectral-estimator that uses $\mathcal{O}_k\bigl(\frac{d\log^2d}{\epsilon^4}\bigr)$ samples and runs in time $\mathcal{O}_{k,\epsilon}(d^3\log^5 d)$, both significantly lower than previously known.