Search Results for author: Adnan Hoque

Found 2 papers, 0 papers with code

TP-Aware Dequantization

no code implementations • 15 Jan 2024 • Adnan Hoque, Mudhakar Srivatsa, Chih-Chieh Yang, Raghu Ganti

In this paper, we present a novel method that reduces model inference latency during distributed deployment of Large Language Models (LLMs).

Paper
Add Code

Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK work decomposition

no code implementations • 5 Jan 2024 • Adnan Hoque, Less Wright, Chih-Chieh Yang, Mudhakar Srivatsa, Raghu Ganti

Our implementation shows improvement for the type of skinny matrix-matrix multiplications found in foundation model inference workloads.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.