TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Multi-task Language Understanding	MMLU	Mixtral-8x7B-Instruct-v0.1	Average (%)	20.0	# 106
Multi-task Language Understanding	MMLU	GPT-3 6.7B (5-shot)	Average (%)	24.9	# 104
Multi-task Language Understanding	MMLU	GPT-3 13B (5-shot)	Average (%)	26	# 98
Multi-task Language Understanding	MMLU	GPT-3 175B (5-shot)	Average (%)	43.9	# 73
Multi-task Language Understanding	MMLU	Random chance baseline	Average (%)	25.0	# 103
Multi-task Language Understanding	MMLU	GPT-3 175B (fine-tuned)	Average (%)	53.9	# 60
Multi-task Language Understanding	MMLU	GPT-3 2.7B (5-shot)	Average (%)	25.9	# 100
Multi-task Language Understanding	MMLU	GPT-2-XL 1.5B (fine-tuned)	Average (%)	32.4	# 87
Multi-task Language Understanding	MMLU	GPT-3 6.7B (fine-tuned)	Average (%)	43.2	# 75
Multi-task Language Understanding	MMLU	GPT-3 175B (0-shot)	Average (%)	37.7	# 82

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/measuring-massive-multitask-language/multi-task-language-understanding-on-mmlu)](https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu?p=measuring-massive-multitask-language)`

Measuring Massive Multitask Language Understanding

7 Sep 2020 · Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt ·

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

PDF Abstract

Code

Add Remove Mark official

hendrycks/test official

914

baichuan-inc/baichuan-7b

↳ Quickstart in

Spaces

5,632

baichuan-inc/baichuan2

3,912

baichuan-inc/baichuan-13b

2,955

duxiaoman-di/xuanyuan

859

See all 12 implementations

Tasks

Add Remove

Elementary Mathematics

Multi-task Language Understanding

Multi-Task Learning

World Knowledge

Datasets

Introduced in the Paper:

MMLU

Used in the Paper:

GLUE

HellaSwag

SuperGLUE Outlier Exposure ETHICS

Results from the Paper

Add Remove

Ranked #60 on Multi-task Language Understanding on MMLU

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Multi-task Language Understanding	MMLU	Mixtral-8x7B-Instruct-v0.1	Average (%)	20.0	# 106	Compare
Multi-task Language Understanding	MMLU	GPT-3 6.7B (5-shot)	Average (%)	24.9	# 104	Compare
Multi-task Language Understanding	MMLU	GPT-3 13B (5-shot)	Average (%)	26	# 98	Compare
Multi-task Language Understanding	MMLU	GPT-3 175B (5-shot)	Average (%)	43.9	# 73	Compare
Multi-task Language Understanding	MMLU	Random chance baseline	Average (%)	25.0	# 103	Compare
Multi-task Language Understanding	MMLU	GPT-3 175B (fine-tuned)	Average (%)	53.9	# 60	Compare
Multi-task Language Understanding	MMLU	GPT-3 2.7B (5-shot)	Average (%)	25.9	# 100	Compare
Multi-task Language Understanding	MMLU	GPT-2-XL 1.5B (fine-tuned)	Average (%)	32.4	# 87	Compare
Multi-task Language Understanding	MMLU	GPT-3 6.7B (fine-tuned)	Average (%)	43.2	# 75	Compare
Multi-task Language Understanding	MMLU	GPT-3 175B (0-shot)	Average (%)	37.7	# 82	Compare

Methods

Add Remove

Adam • Attention Dropout • BPE • Cosine Annealing • Dense Connections • Dropout • Fixed Factorized Attention • GELU • GPT-3 • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Strided Attention • Weight Decay

Edit Social Preview

Measuring Massive Multitask Language Understanding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove