no code implementations • 12 Mar 2024 • Saurabh Agarwal, Bilge Acun, Basil Hosmer, Mostafa Elhoushi, Yejin Lee, Shivaram Venkataraman, Dimitris Papailiopoulos, Carole-Jean Wu
We observe that there is a high amount of redundancy across heads on which tokens they pay attention to.
1 code implementation • 2 Feb 2024 • Minghao Yan, Saurabh Agarwal, Shivaram Venkataraman
Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without sacrificing quality.
no code implementations • 30 Oct 2023 • Minghao Yan, Hongyi Wang, Shivaram Venkataraman
As neural networks (NN) are deployed across diverse sectors, their energy demand correspondingly grows.
no code implementations • 6 Jan 2023 • Song Bian, Dacheng Li, Hongyi Wang, Eric P. Xing, Shivaram Venkataraman
Finally, we provide insights for future development of model parallelism compression algorithms.
no code implementations • 24 Feb 2022 • Saurabh Agarwal, Chengpo Yan, Ziyi Zhang, Shivaram Venkataraman
Based on these insights, we develop Bagpipe, a system for training deep recommendation models that uses caching and prefetching to overlap remote embedding accesses with the computation.
1 code implementation • 4 Feb 2022 • Roger Waleffe, Jason Mohoney, Theodoros Rekatsinas, Shivaram Venkataraman
We study training of Graph Neural Networks (GNNs) for large-scale graphs.
1 code implementation • 20 Nov 2021 • Adarsh Kumar, Kausik Subramanian, Shivaram Venkataraman, Aditya Akella
This simultaneously reduces network bandwidth, compute utilization, and memory footprint while preserving model quality.
3 code implementations • 4 Jul 2021 • J. Gregory Pauloski, Qi Huang, Lei Huang, Shivaram Venkataraman, Kyle Chard, Ian Foster, Zhao Zhang
Kronecker-factored Approximate Curvature (K-FAC) has recently been shown to converge faster in deep neural network (DNN) training than stochastic gradient descent (SGD); however, K-FAC's larger memory footprint hinders its applicability to large models.
1 code implementation • 28 Feb 2021 • Saurabh Agarwal, Hongyi Wang, Shivaram Venkataraman, Dimitris Papailiopoulos
A rich body of prior work has highlighted the existence of communication bottlenecks in synchronous data-parallel training.
1 code implementation • 2 Feb 2021 • YuHan Liu, Saurabh Agarwal, Shivaram Venkataraman
With the rapid adoption of machine learning (ML), a number of domains now use the approach of fine tuning models which were pre-trained on a large corpus of data.
1 code implementation • 20 Jan 2021 • Jason Mohoney, Roger Waleffe, Yiheng Xu, Theodoros Rekatsinas, Shivaram Venkataraman
We propose a new framework for computing the embeddings of large-scale graphs on a single machine.
no code implementations • 18 Jan 2021 • Arjun Balasubramanian, Adarsh Kumar, YuHan Liu, Han Cao, Shivaram Venkataraman, Aditya Akella
We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency DNN inference.
3 code implementations • 29 Oct 2020 • Saurabh Agarwal, Hongyi Wang, Kangwook Lee, Shivaram Venkataraman, Dimitris Papailiopoulos
The techniques usually require choosing a static compression ratio, often requiring users to balance the trade-off between model accuracy and per-iteration speedup.
no code implementations • 7 Feb 2020 • Adarsh Kumar, Arjun Balasubramanian, Shivaram Venkataraman, Aditya Akella
In this work, we observe that caching intermediate layer outputs can help us avoid running all the layers of a DNN for a sizeable fraction of inference requests.
no code implementations • 11 Oct 2019 • Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Jorgen Thelin, Nikhil Devanur, Ion Stoica
Model parameter synchronization across GPUs introduces high overheads for data-parallel training at scale.
no code implementations • 2 May 2019 • Jack Kosaian, K. V. Rashmi, Shivaram Venkataraman
In order to scale to high query rates, prediction serving systems are run on many machines in cluster settings, and thus are prone to slowdowns and failures that inflate tail latency and cause violations of strict latency targets.
no code implementations • 29 Mar 2019 • Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood, Furong Huang, Martin Jaggi, Kevin Jamieson, Michael. I. Jordan, Gauri Joshi, Rania Khalaf, Jason Knight, Jakub Konečný, Tim Kraska, Arun Kumar, Anastasios Kyrillidis, Aparna Lakshmiratan, Jing Li, Samuel Madden, H. Brendan McMahan, Erik Meijer, Ioannis Mitliagkas, Rajat Monga, Derek Murray, Kunle Olukotun, Dimitris Papailiopoulos, Gennady Pekhimenko, Theodoros Rekatsinas, Afshin Rostamizadeh, Christopher Ré, Christopher De Sa, Hanie Sedghi, Siddhartha Sen, Virginia Smith, Alex Smola, Dawn Song, Evan Sparks, Ion Stoica, Vivienne Sze, Madeleine Udell, Joaquin Vanschoren, Shivaram Venkataraman, Rashmi Vinayak, Markus Weimer, Andrew Gordon Wilson, Eric Xing, Matei Zaharia, Ce Zhang, Ameet Talwalkar
Machine learning (ML) techniques are enjoying rapidly increasing adoption.
1 code implementation • 17 Jan 2019 • Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, Fan Yang
With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products.
Distributed, Parallel, and Cluster Computing
3 code implementations • 4 Jun 2018 • Jack Kosaian, K. V. Rashmi, Shivaram Venkataraman
To the best of our knowledge, this work proposes the first learning-based approach for designing codes, and also presents the first coding-theoretic solution that can provide resilience for any non-linear (differentiable) computation.
no code implementations • 20 Feb 2017 • Xinghao Pan, Shivaram Venkataraman, Zizheng Tai, Joseph Gonzalez
Distributed optimization algorithms are widely used in many industrial machine learning applications.
no code implementations • 29 Oct 2016 • Evan R. Sparks, Shivaram Venkataraman, Tomer Kaftan, Michael J. Franklin, Benjamin Recht
Modern advanced analytics applications make use of machine learning techniques and contain multiple steps of domain-specific and general-purpose processing with high resource requirements.
no code implementations • 17 Feb 2016 • Stephen Tu, Rebecca Roelofs, Shivaram Venkataraman, Benjamin Recht
We demonstrate that distributed block coordinate descent can quickly solve kernel regression and classification problems with millions of data points.
no code implementations • 26 May 2015 • Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, Ameet Talwalkar
Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks.