TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Machine Translation	IWSLT2015 Thai-English	Seq-KD + Seq-Inter + Word-KD	BLEU score	14.2	# 1
Machine Translation	WMT2014 English-German	Seq-KD + Seq-Inter + Word-KD	BLEU score	18.5	# 85
Machine Translation	WMT2014 English-German	Seq-KD + Seq-Inter + Word-KD	Hardware Burden	None	# 1
Machine Translation	WMT2014 English-German	Seq-KD + Seq-Inter + Word-KD	Operations per network pass	None	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/sequence-level-knowledge-distillation/machine-translation-on-iwslt2015-thai-english)](https://paperswithcode.com/sota/machine-translation-on-iwslt2015-thai-english?p=sequence-level-knowledge-distillation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/sequence-level-knowledge-distillation/machine-translation-on-wmt2014-english-german)](https://paperswithcode.com/sota/machine-translation-on-wmt2014-english-german?p=sequence-level-knowledge-distillation)`

Sequence-Level Knowledge Distillation

EMNLP 2016 · Yoon Kim, Alexander M. Rush ·

Neural machine translation (NMT) offers a novel alternative formulation of translation that is potentially simpler than statistical approaches. However to reach competitive performance, NMT models need to be exceedingly large. In this paper we consider applying knowledge distillation approaches (Bucila et al., 2006; Hinton et al., 2015) that have proven successful for reducing the size of neural models in other domains to the problem of NMT. We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model). Our best student model runs 10 times faster than its state-of-the-art teacher with little loss in performance. It is also significantly better than a baseline model trained without knowledge distillation: by 4.2/1.7 BLEU with greedy decoding/beam search. Applying weight pruning on top of knowledge distillation results in a student model that has 13 times fewer parameters than the original teacher model, with a decrease of 0.4 BLEU.

PDF Abstract EMNLP 2016 PDF EMNLP 2016 Abstract

Code

Add Remove Mark official

harvardnlp/seq2seq-attn official

1,244

harvardnlp/nmt-android official

facebookresearch/stopes

238

ictnlp/Seq-NAT

xuanlinli17/autoregressive_inference

See all 6 implementations

Tasks

Add Remove

Knowledge Distillation

Machine Translation

NMT

Translation

Datasets

WMT 2014

Results from the Paper

Edit

Ranked #1 on Machine Translation on IWSLT2015 Thai-English

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Machine Translation	IWSLT2015 Thai-English	Seq-KD + Seq-Inter + Word-KD	BLEU score	14.2	# 1	Compare
Machine Translation	WMT2014 English-German	Seq-KD + Seq-Inter + Word-KD	BLEU score	18.5	# 85	Compare
			Hardware Burden	None	# 1	Compare
			Operations per network pass	None	# 1	Compare

Methods

Add Remove

Knowledge Distillation

Edit Social Preview

Sequence-Level Knowledge Distillation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove