1 code implementation • 8 May 2024 • Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav Gulavani, Ramachandran Ramjee, Alexey Tumanov
Vidur models the performance of LLM operators using a combination of experimental profiling and predictive modeling, and evaluates the end-to-end inference performance for different workloads by estimating several metrics of interest such as latency and throughput.
no code implementations • 7 May 2024 • Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar
Thus, vAttention unburdens the attention kernel developer from having to explicitly support paging and avoids re-implementation of memory management in the serving framework.
no code implementations • 4 Mar 2024 • Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee
However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency.
no code implementations • 31 Aug 2023 • Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee
SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes.
no code implementations • 12 Oct 2021 • Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni, Vijay Chidambaram
Unfortunately, these schedulers do not consider the impact of a job's sensitivity to allocation of CPU, memory, and storage resources.
1 code implementation • ICLR 2021 • Aashaka Shah, Chao-yuan Wu, Jayashree Mohan, Vijay Chidambaram, Philipp Krähenbühl
Deep learning is slowly, but steadily, hitting a memory bottleneck.
no code implementations • 14 Jul 2020 • Jayashree Mohan, Amar Phanishayee, Ashish Raniwala, Vijay Chidambaram
We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft.
2 code implementations • 23 Sep 2019 • Se Kwon Lee, Jayashree Mohan, Sanidhya Kashyap, Taesoo Kim, Vijay Chidambaram
We present Recipe, a principled approach for converting concurrent DRAM indexes into crash-consistent indexes for persistent memory (PM).
Distributed, Parallel, and Cluster Computing Databases Data Structures and Algorithms