BERTweet: A pre-trained language model for English Tweets

EMNLP 2020  ·  Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen ·

We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet, having the same architecture as BERT-base (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al., 2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks: Part-of-speech tagging, Named-entity recognition and text classification. We release BERTweet under the MIT License to facilitate future research and applications on Tweet data. Our BERTweet is available at https://github.com/VinAIResearch/BERTweet

PDF Abstract EMNLP 2020 PDF EMNLP 2020 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Part-Of-Speech Tagging Ritter BERTweet Acc 90.1 # 4
Part-Of-Speech Tagging Tweebank BERTweet Acc 95.2 # 2
Sentiment Analysis TweetEval BERTweet Emoji 33.4 # 1
Emotion 79.3 # 2
Irony 82.1 # 1
Offensive 79.5 # 2
Sentiment 73.4 # 1
Stance 71.2 # 1
ALL 67.9 # 1
Named Entity Recognition (NER) WNUT 2016 BERTweet F1 52.1 # 7
Named Entity Recognition (NER) WNUT 2017 BERTweet F1 56.5 # 7

Methods