Search Results for author: Haihao Shen

Found 11 papers, 10 papers with code

Effective Quantization for Diffusion Models on CPUs

1 code implementation • 2 Nov 2023 • Hanwen Chang, Haihao Shen, Yiyang Cai, Xinyu Ye, Zhenzhong Xu, Wenhua Cheng, Kaokao Lv, Weiwei Zhang, Yintong Lu, Heng Guo

Diffusion models have gained popularity for generating images from textual descriptions.

Quantization

1,946

Paper
Code

Efficient LLM Inference on CPUs

2 code implementations • 1 Nov 2023 • Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo, Hengyu Meng

Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks.

Llama Quantization

1,946

Paper
Code

TEQ: Trainable Equivalent Transformation for Quantization of LLMs

1 code implementation • 17 Oct 2023 • Wenhua Cheng, Yiyang Cai, Kaokao Lv, Haihao Shen

As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computationalast layer demands of these modern architectures while maintaining the accuracy.

Quantization

1,965

Paper
Code

Efficient Post-training Quantization with FP8 Formats

2 code implementations • 26 Sep 2023 • Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, Mengni Wang

Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy.

Image Classification Language Modelling +3

1,965

Paper
Code

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

1 code implementation • 11 Sep 2023 • Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Kaokao Lv

As the number of bits decreases, the quantization grid broadens, thus emphasizing the importance of up and down rounding.

Quantization

1,965

Paper
Code

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

1 code implementation • 28 Jun 2023 • Haihao Shen, Hengyu Meng, Bo Dong, Zhe Wang, Ofir Zafrir, Yi Ding, Yu Luo, Hanwen Chang, Qun Gao, Ziheng Wang, Guy Boudoukh, Moshe Wasserblat

We apply our sparse accelerator on widely-used Transformer-based language models including Bert-Mini, DistilBERT, Bert-Base, and BERT-Large.

Model Compression

1,946

Paper
Code

QuaLA-MiniLM: a Quantized Length Adaptive MiniLM

2 code implementations • 31 Oct 2022 • Shira Guskin, Moshe Wasserblat, Chang Wang, Haihao Shen

Our quantized length-adaptive MiniLM model (QuaLA-MiniLM) is trained only once, dynamically fits any inference scenario, and achieves an accuracy-efficiency trade-off superior to any other efficient approaches per any computational budget on the SQuAD1. 1 dataset (up to x8. 8 speedup with <1% accuracy loss).

Computational Efficiency Knowledge Distillation +2

1,965

Paper
Code

Fast DistilBERT on CPUs

1 code implementation • 27 Oct 2022 • Haihao Shen, Ofir Zafrir, Bo Dong, Hengyu Meng, Xinyu Ye, Zhe Wang, Yi Ding, Hanwen Chang, Guy Boudoukh, Moshe Wasserblat

In this work, we propose a new pipeline for creating and running Fast Transformer models on CPUs, utilizing hardware-aware pruning, knowledge distillation, quantization, and our own Transformer inference runtime engine with optimized kernels for sparse and quantized operators.

Knowledge Distillation Model Compression +2

1,946

Paper
Code

Prune Once for All: Sparse Pre-Trained Language Models

2 code implementations • 10 Nov 2021 • Ofir Zafrir, Ariel Larey, Guy Boudoukh, Haihao Shen, Moshe Wasserblat

We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss.

Ranked #2 on Natural Language Inference on MultiNLI Dev