Distilling Knowledge from Pre-trained Language Models via Text Smoothing

8 May 2020Xing WuYibing LiuXiangyang ZhouDianhai Yu

This paper studies compressing pre-trained language models, like BERT (Devlin et al.,2019), via teacher-student knowledge distillation. Previous works usually force the student model to strictly mimic the smoothed labels predicted by the teacher BERT... (read more)

PDF Abstract

Code


No code implementations yet. Submit your code now

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods used in the Paper