Search Results for author: Siddhant Ray

Found 3 papers, 2 papers with code

Chatterbox: Robust Transport for LLM Token Streaming under Unstable Network

no code implementations23 Jan 2024 Hanchen Li, YuHan Liu, Yihua Cheng, Siddhant Ray, Kuntai Du, Junchen Jiang

To render each generated token in real time, the LLM server generates response tokens one by one and streams each generated token (or group of a few tokens) through the network to the user right after it is generated, which we refer to as LLM token streaming.

Chatbot

CacheGen: KV Cache Compression and Streaming for Fast Language Model Serving

1 code implementation11 Oct 2023 YuHan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, YuYang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, Junchen Jiang

Compared to the recent systems that reuse the KV cache, CacheGen reduces the KV cache size by 3. 5-4. 3x and the total delay in fetching and processing contexts by 3. 2-3. 7x while having negligible impact on the LLM response quality in accuracy or perplexity.

Language Modelling Quantization

A new hope for network model generalization

1 code implementation12 Jul 2022 Alexander Dietmüller, Siddhant Ray, Romain Jacob, Laurent Vanbever

Hence for every new task, we design new models and train them on model-specific datasets closely mimicking the deployment environments.

Cannot find the paper you are looking for? You can Submit a new open access paper.