Search Results for author: Saleh Ashkboos

Found 10 papers, 6 papers with code

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

1 code implementation • 30 Mar 2024 • Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman

We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits.

Quantization

136

Paper
Code

SliceGPT: Compress Large Language Models by Deleting Rows and Columns

1 code implementation • 26 Jan 2024 • Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman

Large language models have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of compute and memory resources.

289

Paper
Code

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

1 code implementation • 13 Oct 2023 • Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh

We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups, while at the same time maintaining good accuracy.

Computational Efficiency Quantization

154

Paper
Code

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

1 code implementation • 5 Jun 2023 • Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh

Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities.

Language Modelling Large Language Model +1

512

Paper
Code

STen: Productive and Efficient Sparsity in PyTorch

no code implementations • 15 Apr 2023 • Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Saleh Ashkboos, Torsten Hoefler

As deep learning models grow, sparsity is becoming an increasingly critical component of deep neural networks, enabling improved performance and reduced storage.

Paper
Add Code

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

11 code implementations • 31 Oct 2022 • Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient.

Language Modelling Model Compression +1

18,538

Paper
Code

ENS-10: A Dataset For Post-Processing Ensemble Weather Forecasts

1 code implementation • 29 Jun 2022 • Saleh Ashkboos, Langwen Huang, Nikoli Dryden, Tal Ben-Nun, Peter Dueben, Lukas Gianinazzi, Luca Kummer, Torsten Hoefler

We propose the ENS-10 prediction correction task for improving the forecast quality at a 48-hour lead time through ensemble post-processing.

Weather Forecasting

Paper
Code

Motif Prediction with Graph Neural Networks

no code implementations • 26 May 2021 • Maciej Besta, Raphael Grob, Cesare Miglioli, Nicola Bernold, Grzegorz Kwasniewski, Gabriel Gjini, Raghavendra Kanakagiri, Saleh Ashkboos, Lukas Gianinazzi, Nikoli Dryden, Torsten Hoefler

We also successfully apply our architecture for predicting more arbitrary clusters and communities, illustrating its potential for graph mining beyond motif analysis.

Graph Mining Link Prediction

Paper
Add Code

New Bounds For Distributed Mean Estimation and Variance Reduction

no code implementations • ICLR 2021 • Peter Davies, Vijaykrishna Gurunathan, Niusha Moshrefi, Saleh Ashkboos, Dan Alistarh

We provide a method of quantization which allows distributed mean estimation to be performed with solution quality dependent only on the distance between inputs, not on input norm, and show an analogous result for distributed variance reduction.

Distributed Optimization Quantization

Paper
Add Code

SparCML: High-Performance Sparse Communication for Machine Learning

no code implementations • 22 Feb 2018 • Cedric Renggli, Saleh Ashkboos, Mehdi Aghagolzadeh, Dan Alistarh, Torsten Hoefler

This allreduce is the single communication and thus scalability bottleneck for most machine learning workloads.

BIG-bench Machine Learning Blocking +1

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.