Search Results for author: Minsoo Rhu

Found 20 papers, 3 papers with code

vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training

no code implementations27 Nov 2023 Jehyeon Bang, Yujeong Choi, Myeongwoo Kim, YongDeok Kim, Minsoo Rhu

As large language models (LLMs) become widespread in various application domains, a critical challenge the AI community is facing is how to train these large AI models in a cost-effective manner.

Language Modelling Large Language Model

Hera: A Heterogeneity-Aware Multi-Tenant Inference Server for Personalized Recommendations

no code implementations23 Feb 2023 Yujeong Choi, John Kim, Minsoo Rhu

While providing low latency is a fundamental requirement in deploying recommendation services, achieving high resource utility is also crucial in cost-effectively maintaining the datacenter.

GPU-based Private Information Retrieval for On-Device Machine Learning Inference

1 code implementation26 Jan 2023 Maximilian Lam, Jeff Johnson, Wenjie Xiong, Kiwan Maeng, Udit Gupta, Yang Li, Liangzhen Lai, Ilias Leontiadis, Minsoo Rhu, Hsien-Hsin S. Lee, Vijay Janapa Reddi, Gu-Yeon Wei, David Brooks, G. Edward Suh

Together, for various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to $100, 000$ queries per second -- a $>100 \times$ throughput improvement over a CPU-based baseline -- while maintaining model accuracy.

Information Retrieval Language Modelling +1

DiVa: An Accelerator for Differentially Private Machine Learning

no code implementations26 Aug 2022 Beomsik Park, Ranggi Hwang, Dongho Yoon, Yoonhyuk Choi, Minsoo Rhu

The widespread deployment of machine learning (ML) is raising serious concerns on protecting the privacy of users who contributed to the collection of training data.

Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward not Backwards

no code implementations10 May 2022 Youngeun Kwon, Minsoo Rhu

Prior work proposed to cache frequently accessed embeddings inside GPU memory as means to filter down the embedding layer traffic to CPU memory, but this paper observes several limitations with such cache design.

Recommendation Systems

SmartSAGE: Training Large-scale Graph Neural Networks using In-Storage Processing Architectures

no code implementations10 May 2022 Yunjae Lee, Jinha Chung, Minsoo Rhu

Our work demonstrates that an ISP based large-scale GNN training system can achieve both high capacity storage and high performance, opening up opportunities for ML practitioners to train large GNN datasets without being hampered by the physical limitations of main memory size.

GROW: A Row-Stationary Sparse-Dense GEMM Accelerator for Memory-Efficient Graph Convolutional Neural Networks

no code implementations1 Mar 2022 Ranggi Hwang, Minhoo Kang, Jiwon Lee, Dongyun Kam, Youngjoo Lee, Minsoo Rhu

Graph convolutional neural networks (GCNs) have emerged as a key technology in various application domains where the input data is relational.

PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers

no code implementations27 Feb 2022 Yunseong Kim, Yujeong Choi, Minsoo Rhu

However, maximizing server utilization and system throughput is also crucial for ML service providers as it helps lower the total-cost-of-ownership.

Scheduling

Tensor Casting: Co-Designing Algorithm-Architecture for Personalized Recommendation Training

no code implementations25 Oct 2020 Youngeun Kwon, Yunjae Lee, Minsoo Rhu

Personalized recommendations are one of the most widely deployed machine learning (ML) workload serviced from cloud datacenters.

LazyBatching: An SLA-aware Batching System for Cloud Machine Learning Inference

no code implementations25 Oct 2020 Yujeong Choi, Yunseong Kim, Minsoo Rhu

In cloud ML inference systems, batching is an essential technique to increase throughput which helps optimize total-cost-of-ownership.

BIG-bench Machine Learning Scheduling

Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations

no code implementations12 May 2020 Ranggi Hwang, Taehun Kim, Youngeun Kwon, Minsoo Rhu

Personalized recommendations are the backbone machine learning (ML) algorithm that powers several important application domains (e. g., ads, e-commerce, etc) serviced from cloud datacenters.

NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units

no code implementations15 Nov 2019 Bongjoon Hyun, Youngeun Kwon, Yujeong Choi, John Kim, Minsoo Rhu

To satisfy the compute and memory demands of deep neural networks, neural processing units (NPUs) are widely being utilized for accelerating deep learning algorithms.

Management Translation

PREMA: A Predictive Multi-task Scheduling Algorithm For Preemptible Neural Processing Units

1 code implementation6 Sep 2019 Yujeong Choi, Minsoo Rhu

To amortize cost, cloud vendors providing DNN acceleration as a service to end-users employ consolidation and virtualization to share the underlying resources among multiple DNN service requests.

Scheduling

TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning

no code implementations8 Aug 2019 Youngeun Kwon, Yunjae Lee, Minsoo Rhu

Recent studies from several hyperscalars pinpoint to embedding layers as the most memory-intensive deep learning (DL) algorithm being deployed in today's datacenters.

Recommendation Systems

Beyond the Memory Wall: A Case for Memory-centric HPC System for Deep Learning

no code implementations18 Feb 2019 Youngeun Kwon, Minsoo Rhu

As the models and the datasets to train deep learning (DL) models scale, system architects are faced with new challenges, one of which is the memory capacity bottleneck, where the limited physical memory inside the accelerator device constrains the algorithm that can be studied.

Structurally Sparsified Backward Propagation for Faster Long Short-Term Memory Training

no code implementations1 Jun 2018 Maohua Zhu, Jason Clemons, Jeff Pool, Minsoo Rhu, Stephen W. Keckler, Yuan Xie

Further, we can enforce structured sparsity in the gate gradients to make the LSTM backward pass up to 45% faster than the state-of-the-art dense approach and 168% faster than the state-of-the-art sparsifying method on modern GPUs.

Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks

no code implementations3 May 2017 Minsoo Rhu, Mike O'Connor, Niladrish Chatterjee, Jeff Pool, Stephen W. Keckler

Popular deep learning frameworks require users to fine-tune their memory usage so that the training data of a deep neural network (DNN) fits within the GPU physical memory.

vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design

4 code implementations25 Feb 2016 Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, Stephen W. Keckler

The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU.

BIG-bench Machine Learning Efficient Neural Network

Cannot find the paper you are looking for? You can Submit a new open access paper.