Search Results for author: Jayashree Mohan

Found 8 papers, 3 papers with code

Vidur: A Large-Scale Simulation Framework For LLM Inference

1 code implementation • 8 May 2024 • Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav Gulavani, Ramachandran Ramjee, Alexey Tumanov

Vidur models the performance of LLM operators using a combination of experimental profiling and predictive modeling, and evaluates the end-to-end inference performance for different workloads by estimating several metrics of interest such as latency and throughput.

Scheduling

Paper
Code

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

no code implementations • 7 May 2024 • Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar

Thus, vAttention unburdens the attention kernel developer from having to explicitly support paging and avoids re-implementation of memory management in the serving framework.

Management

Paper
Add Code

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

no code implementations • 4 Mar 2024 • Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee

However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency.

Scheduling

Paper
Add Code

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

no code implementations • 31 Aug 2023 • Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee

SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes.

Language Modelling Large Language Model

Paper
Add Code

Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters

no code implementations • 12 Oct 2021 • Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni, Vijay Chidambaram

Unfortunately, these schedulers do not consider the impact of a job's sensitivity to allocation of CPU, memory, and storage resources.

Scheduling

Paper
Add Code

Memory Optimization for Deep Networks

1 code implementation • ICLR 2021 • Aashaka Shah, Chao-yuan Wu, Jayashree Mohan, Vijay Chidambaram, Philipp Krähenbühl

Deep learning is slowly, but steadily, hitting a memory bottleneck.

172

Paper
Code

Analyzing and Mitigating Data Stalls in DNN Training

no code implementations • 14 Jul 2020 • Jayashree Mohan, Amar Phanishayee, Ashish Raniwala, Vijay Chidambaram

We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft.

Paper
Add Code

RECIPE : Converting Concurrent DRAM Indexes to Persistent-Memory Indexes

2 code implementations • 23 Sep 2019 • Se Kwon Lee, Jayashree Mohan, Sanidhya Kashyap, Taesoo Kim, Vijay Chidambaram

We present Recipe, a principled approach for converting concurrent DRAM indexes into crash-consistent indexes for persistent memory (PM).

Distributed, Parallel, and Cluster Computing Databases Data Structures and Algorithms

194

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.