1 code implementation • 16 Aug 2023 • Vlad-Constantin Lungu-Stan, Dumitru-Clementin Cercel, Florin Pop
By adding classification heads at each level of the transformer and employing a cascading distillation process, we improve the balanced multi-class accuracy of the base model by 2. 1%, while creating a range of models of various sizes but comparable performance.