Search Results for author: Varun Yerram

Found 1 papers, 0 papers with code

HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference

no code implementations • 14 Feb 2024 • Yashas Samaga B L, Varun Yerram, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli

Autoregressive decoding with generative Large Language Models (LLMs) on accelerators (GPUs/TPUs) is often memory-bound where most of the time is spent on transferring model parameters from high bandwidth memory (HBM) to cache.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.