TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Knowledge Distillation	ImageNet	SRD (T: ResNet-34 S:ResNet-18)	Top-1 accuracy %	71.87	# 27
Knowledge Distillation	ImageNet	SRD (T: ResNet-34 S:ResNet-18)	CRD training setting	✓	# 1
Knowledge Distillation	ImageNet	SRD (T:RegNety 160 S:DeiT-S)	Top-1 accuracy %	82.1	# 4
Knowledge Distillation	ImageNet	SRD (T:RegNety 160 S:DeiT-S)	model size	22M	# 8
Knowledge Distillation	ImageNet	SRD (T:RegNety 160 S:DeiT-S)	CRD training setting	✘	# 1
Knowledge Distillation	ImageNet	SRD (T:RegNety 160 S:DeIT-Ti)	Top-1 accuracy %	77.2	# 10
Knowledge Distillation	ImageNet	SRD (T:RegNety 160 S:DeIT-Ti)	model size	6M	# 11
Knowledge Distillation	ImageNet	SRD (T:RegNety 160 S:DeIT-Ti)	CRD training setting	✘	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-closer-look-at-the-training-dynamics-of/knowledge-distillation-on-imagenet)](https://paperswithcode.com/sota/knowledge-distillation-on-imagenet?p=a-closer-look-at-the-training-dynamics-of)`

Understanding the Role of the Projector in Knowledge Distillation

20 Mar 2023 · Roy Miles, Krystian Mikolajczyk ·

In this paper we revisit the efficacy of knowledge distillation as a function matching and metric learning problem. In doing so we verify three important design decisions, namely the normalisation, soft maximum function, and projection layers as key ingredients. We theoretically show that the projector implicitly encodes information on past examples, enabling relational gradients for the student. We then show that the normalisation of representations is tightly coupled with the training dynamics of this projector, which can have a large impact on the students performance. Finally, we show that a simple soft maximum function can be used to address any significant capacity gap problems. Experimental results on various benchmark datasets demonstrate that using these insights can lead to superior or comparable performance to state-of-the-art knowledge distillation techniques, despite being much more computationally efficient. In particular, we obtain these results across image classification (CIFAR100 and ImageNet), object detection (COCO2017), and on more difficult distillation objectives, such as training data efficient transformers, whereby we attain a 77.2% top-1 accuracy with DeiT-Ti on ImageNet. Code and models are publicly available.

PDF Abstract

Code

Add Remove Mark official

roymiles/simple-recipe-distillation official

yoshitomo-matsubara/torchdistill

↳ Quickstart in

Colab

1,275

Hazqeel09/ellzaf_ml

roymiles/vkd

Tasks

Add Remove

Image Classification

Knowledge Distillation

Metric Learning

object-detection

Object Detection

Datasets

ImageNet

MS COCO

CIFAR-100 ImageNet-1K

Results from the Paper

Edit

Ranked #4 on Knowledge Distillation on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Knowledge Distillation	ImageNet	SRD (T: ResNet-34 S:ResNet-18)	Top-1 accuracy %	71.87	# 27	Compare
Knowledge Distillation	ImageNet	SRD (T: ResNet-34 S:ResNet-18)	CRD training setting	✓	# 1	Compare
Knowledge Distillation	ImageNet	SRD (T:RegNety 160 S:DeiT-S)	Top-1 accuracy %	82.1	# 4	Compare
			model size	22M	# 8	Compare
			CRD training setting	✘	# 1	Compare
Knowledge Distillation	ImageNet	SRD (T:RegNety 160 S:DeIT-Ti)	Top-1 accuracy %	77.2	# 10	Compare
			model size	6M	# 11	Compare
			CRD training setting	✘	# 1	Compare

Methods

Add Remove

Knowledge Distillation

Edit Social Preview

Understanding the Role of the Projector in Knowledge Distillation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove