Information Theoretic Representation Distillation

1 Dec 2021  ·  Roy Miles, Adrian Lopez Rodriguez, Krystian Mikolajczyk ·

Despite the empirical success of knowledge distillation, current state-of-the-art methods are computationally expensive to train, which makes them difficult to adopt in practice. To address this problem, we introduce two distinct complementary losses inspired by a cheap entropy-like estimator. These losses aim to maximise the correlation and mutual information between the student and teacher representations. Our method incurs significantly less training overheads than other approaches and achieves competitive performance to the state-of-the-art on the knowledge distillation and cross-model transfer tasks. We further demonstrate the effectiveness of our method on a binary distillation task, whereby it leads to a new state-of-the-art for binary quantisation and approaches the performance of a full precision model. Code: www.github.com/roymiles/ITRD

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Classification with Binary Weight Network CIFAR-10 ResNet-18 Top-1 94.1 # 4
Knowledge Distillation CIFAR-100 resnet8x4 (T: resnet32x4 S: resnet8x4) Top-1 Accuracy (%) 76.68 # 7
Knowledge Distillation CIFAR-100 resnet110 (T:resnet110 S:resnet20) Top-1 Accuracy (%) 71.99 # 21
Knowledge Distillation CIFAR-100 vgg8 (T:vgg13 S:vgg8) Top-1 Accuracy (%) 74.93 # 14
Knowledge Distillation ImageNet ITRD (T: ResNet-34 S:ResNet-18) Top-1 accuracy % 71.68 # 30
model size 11.69M # 10
CRD training setting # 1
Question Answering SQuAD1.1 BERT - 3 Layers EM 77.7 # 97
F1 85.8 # 89
Question Answering SQuAD1.1 BERT - 6 Layers EM 81.5 # 50
F1 88.5 # 52

Methods