no code implementations • 12 Apr 2024 • Juntaek Lim, Youngeun Kwon, Ranggi Hwang, Kiwan Maeng, G. Edward Suh, Minsoo Rhu
Differential privacy (DP) is widely being employed in the industry as a practical standard for privacy protection.
no code implementations • 27 Nov 2023 • Jehyeon Bang, Yujeong Choi, Myeongwoo Kim, YongDeok Kim, Minsoo Rhu
As large language models (LLMs) become widespread in various application domains, a critical challenge the AI community is facing is how to train these large AI models in a cost-effective manner.
no code implementations • 23 Feb 2023 • Yujeong Choi, John Kim, Minsoo Rhu
While providing low latency is a fundamental requirement in deploying recommendation services, achieving high resource utility is also crucial in cost-effectively maintaining the datacenter.
1 code implementation • 26 Jan 2023 • Maximilian Lam, Jeff Johnson, Wenjie Xiong, Kiwan Maeng, Udit Gupta, Yang Li, Liangzhen Lai, Ilias Leontiadis, Minsoo Rhu, Hsien-Hsin S. Lee, Vijay Janapa Reddi, Gu-Yeon Wei, David Brooks, G. Edward Suh
Together, for various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to $100, 000$ queries per second -- a $>100 \times$ throughput improvement over a CPU-based baseline -- while maintaining model accuracy.
no code implementations • 26 Aug 2022 • Beomsik Park, Ranggi Hwang, Dongho Yoon, Yoonhyuk Choi, Minsoo Rhu
The widespread deployment of machine learning (ML) is raising serious concerns on protecting the privacy of users who contributed to the collection of training data.
no code implementations • 10 May 2022 • Youngeun Kwon, Minsoo Rhu
Prior work proposed to cache frequently accessed embeddings inside GPU memory as means to filter down the embedding layer traffic to CPU memory, but this paper observes several limitations with such cache design.
no code implementations • 10 May 2022 • Yunjae Lee, Jinha Chung, Minsoo Rhu
Our work demonstrates that an ISP based large-scale GNN training system can achieve both high capacity storage and high performance, opening up opportunities for ML practitioners to train large GNN datasets without being hampered by the physical limitations of main memory size.
no code implementations • 1 Mar 2022 • Ranggi Hwang, Minhoo Kang, Jiwon Lee, Dongyun Kam, Youngjoo Lee, Minsoo Rhu
Graph convolutional neural networks (GCNs) have emerged as a key technology in various application domains where the input data is relational.
no code implementations • 27 Feb 2022 • Yunseong Kim, Yujeong Choi, Minsoo Rhu
However, maximizing server utilization and system throughput is also crucial for ML service providers as it helps lower the total-cost-of-ownership.
no code implementations • 25 Oct 2020 • Youngeun Kwon, Yunjae Lee, Minsoo Rhu
Personalized recommendations are one of the most widely deployed machine learning (ML) workload serviced from cloud datacenters.
no code implementations • 25 Oct 2020 • Yujeong Choi, Yunseong Kim, Minsoo Rhu
In cloud ML inference systems, batching is an essential technique to increase throughput which helps optimize total-cost-of-ownership.
no code implementations • 12 May 2020 • Ranggi Hwang, Taehun Kim, Youngeun Kwon, Minsoo Rhu
Personalized recommendations are the backbone machine learning (ML) algorithm that powers several important application domains (e. g., ads, e-commerce, etc) serviced from cloud datacenters.
no code implementations • 15 Nov 2019 • Bongjoon Hyun, Youngeun Kwon, Yujeong Choi, John Kim, Minsoo Rhu
To satisfy the compute and memory demands of deep neural networks, neural processing units (NPUs) are widely being utilized for accelerating deep learning algorithms.
1 code implementation • 6 Sep 2019 • Yujeong Choi, Minsoo Rhu
To amortize cost, cloud vendors providing DNN acceleration as a service to end-users employ consolidation and virtualization to share the underlying resources among multiple DNN service requests.
no code implementations • 8 Aug 2019 • Youngeun Kwon, Yunjae Lee, Minsoo Rhu
Recent studies from several hyperscalars pinpoint to embedding layers as the most memory-intensive deep learning (DL) algorithm being deployed in today's datacenters.
no code implementations • 18 Feb 2019 • Youngeun Kwon, Minsoo Rhu
As the models and the datasets to train deep learning (DL) models scale, system architects are faced with new challenges, one of which is the memory capacity bottleneck, where the limited physical memory inside the accelerator device constrains the algorithm that can be studied.
no code implementations • 1 Jun 2018 • Maohua Zhu, Jason Clemons, Jeff Pool, Minsoo Rhu, Stephen W. Keckler, Yuan Xie
Further, we can enforce structured sparsity in the gate gradients to make the LSTM backward pass up to 45% faster than the state-of-the-art dense approach and 168% faster than the state-of-the-art sparsifying method on modern GPUs.
no code implementations • 23 May 2017 • Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, William J. Dally
Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning.
no code implementations • 3 May 2017 • Minsoo Rhu, Mike O'Connor, Niladrish Chatterjee, Jeff Pool, Stephen W. Keckler
Popular deep learning frameworks require users to fine-tune their memory usage so that the training data of a deep neural network (DNN) fits within the GPU physical memory.
4 code implementations • 25 Feb 2016 • Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, Stephen W. Keckler
The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU.