2 code implementations • 4 Apr 2024 • Sean Farhat, Deming Chen
We observe that, when distilled on a task from a pre-trained teacher model, a small model can achieve or surpass the performance it would achieve if it was pre-trained then finetuned on that task.