How To Train Your (Compressed) Large Language Model

With the increase in the size of large language models (LLMs), we need compression methods that can reduce the model size while preserving the generality and zero-shot promptability of the model. This goal is more ambitious than the typical compression setup, which reduces the model's size at the expense of specializing it to a specific end-task. To study this, we develop a task-agnostic compression pipeline with a large-scale evaluation comprising language modeling perplexity and 12 zero-shot end-tasks. Our results show that a simple layer-wise pruning followed by continued language model pretraining matches or outperforms three existing state-of-the-art baselines while being 1.5x more computationally efficient. However, unlike typical task-specialized compression, our best-compressed model significantly underperforms a similar-sized model trained from scratch. We posit the half-sized pretrained model as an upper bound for task-agnostic compression and call for future work to bridge this gap under a reasonable token budget. Our findings highlight the inadequacy of existing compression methods for LLMs and establish a requirement for new methods that preserve a model's generality and zero-shot promptability under compression. We release our code and evaluation setup to facilitate reproducibility and help iterate on method design.

PDF Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods