SuperGLUE is a benchmark dataset designed to pose a more rigorous test of language understanding than GLUE. SuperGLUE has the same high-level motivation as GLUE: to provide a simple, hard-to-game measure of progress toward general-purpose language understanding technologies for English. SuperGLUE follows the basic design of GLUE: It consists of a public leaderboard built around eight language understanding tasks, drawing on existing data, accompanied by a single-number
performance metric, and an analysis toolkit. However, it improves upon GLUE in several ways:
- More challenging tasks: SuperGLUE retains the two hardest tasks in GLUE. The remaining tasks were identified from those submitted to an open call for task proposals and were selected based on difficulty for current NLP approaches.
- More diverse task formats: The task formats in GLUE are limited to sentence- and sentence-pair classification. The authors expand the set of task formats in SuperGLUE to include
coreference resolution and question answering (QA).
- Comprehensive human baselines: the authors include human performance estimates for all benchmark tasks, which verify that substantial headroom exists between a strong BERT-based baseline and human performance.
- Improved code support: SuperGLUE is distributed with a new, modular toolkit for work on pretraining, multi-task learning, and transfer learning in NLP, built around standard tools including PyTorch (Paszke et al., 2017) and AllenNLP (Gardner et al., 2017).
- Refined usage rules: The conditions for inclusion on the SuperGLUE leaderboard were revamped to ensure fair competition, an informative leaderboard, and full credit
assignment to data and task creators.