TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Knowledge Distillation	CIFAR-100	resnet8x4 (T: resnet32x4 S: resnet8x4)	Top-1 Accuracy (%)	76.31	# 8
Knowledge Distillation	ImageNet	DIST (T: ResNet-34 S:ResNet-18)	Top-1 accuracy %	72.07	# 23
Knowledge Distillation	ImageNet	DIST (T: ResNet-34 S:ResNet-18)	CRD training setting	✘	# 1
Knowledge Distillation	ImageNet	DIST (T: Swin-L S: Swin-T)	Top-1 accuracy %	82.3	# 2
Knowledge Distillation	ImageNet	DIST (T: Swin-L S: Swin-T)	model size	29M	# 6
Knowledge Distillation	ImageNet	DIST (T: Swin-L S: Swin-T)	CRD training setting	✘	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/knowledge-distillation-from-a-stronger/knowledge-distillation-on-imagenet)](https://paperswithcode.com/sota/knowledge-distillation-on-imagenet?p=knowledge-distillation-from-a-stronger)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/knowledge-distillation-from-a-stronger/knowledge-distillation-on-cifar-100)](https://paperswithcode.com/sota/knowledge-distillation-on-cifar-100?p=knowledge-distillation-from-a-stronger)`

Knowledge Distillation from A Stronger Teacher

21 May 2022 · Tao Huang, Shan You, Fei Wang, Chen Qian, Chang Xu ·

Unlike existing knowledge distillation methods focus on the baseline settings, where the teacher models and training strategies are not that strong and competing as state-of-the-art approaches, this paper presents a method dubbed DIST to distill better from a stronger teacher. We empirically find that the discrepancy of predictions between the student and a stronger teacher may tend to be fairly severer. As a result, the exact match of predictions in KL divergence would disturb the training and make existing methods perform poorly. In this paper, we show that simply preserving the relations between the predictions of teacher and student would suffice, and propose a correlation-based loss to capture the intrinsic inter-class relations from the teacher explicitly. Besides, considering that different instances have different semantic similarities to each class, we also extend this relational match to the intra-class level. Our method is simple yet practical, and extensive experiments demonstrate that it adapts well to various architectures, model sizes and training strategies, and can achieve state-of-the-art performance consistently on image classification, object detection, and semantic segmentation tasks. Code is available at: https://github.com/hunto/DIST_KD .

PDF Abstract

Code

Add Remove Mark official

hunto/dist_kd official

122

yoshitomo-matsubara/torchdistill

↳ Quickstart in

Colab

1,275

Tasks

Add Remove

Image Classification

Knowledge Distillation

Object Detection

Semantic Segmentation

Datasets

ImageNet

MS COCO

CIFAR-100

Results from the Paper

Edit

Ranked #2 on Knowledge Distillation on ImageNet (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Knowledge Distillation	CIFAR-100	resnet8x4 (T: resnet32x4 S: resnet8x4)	Top-1 Accuracy (%)	76.31	# 8	Compare
Knowledge Distillation	ImageNet	DIST (T: ResNet-34 S:ResNet-18)	Top-1 accuracy %	72.07	# 23	Compare
Knowledge Distillation	ImageNet	DIST (T: ResNet-34 S:ResNet-18)	CRD training setting	✘	# 1	Compare
Knowledge Distillation	ImageNet	DIST (T: Swin-L S: Swin-T)	Top-1 accuracy %	82.3	# 2	Compare
			model size	29M	# 6	Compare
			CRD training setting	✘	# 1	Compare

Methods

Add Remove

Knowledge Distillation

Edit Social Preview

Knowledge Distillation from A Stronger Teacher

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove