$V_kD:$ Improving Knowledge Distillation using Orthogonal Projections

10 Mar 2024  ·  Roy Miles, Ismail Elezi, Jiankang Deng ·

Knowledge distillation is an effective method for training small and efficient deep learning models. However, the efficacy of a single method can degenerate when transferring to other tasks, modalities, or even other architectures. To address this limitation, we propose a novel constrained feature distillation method. This method is derived from a small set of core principles, which results in two emerging components: an orthogonal projection and a task-specific normalisation. Equipped with both of these components, our transformer models can outperform all previous methods on ImageNet and reach up to a 4.4% relative improvement over the previous state-of-the-art methods. To further demonstrate the generality of our method, we apply it to object detection and image generation, whereby we obtain consistent and substantial performance improvements over state-of-the-art. Code and models are publicly available: https://github.com/roymiles/vkd

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Knowledge Distillation ImageNet VkD (T:RegNety 160 S:DeiT-S) Top-1 accuracy % 82.3 # 2
model size 22M # 8
CRD training setting # 1
Knowledge Distillation ImageNet VkD (T:RegNety 160 S:DeiT-Ti) Top-1 accuracy % 79.2 # 5
model size 6M # 11
CRD training setting # 1

Methods


No methods listed for this paper. Add relevant methods here