TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Common Sense Reasoning	ARC (Challenge)	UL2 20B (chain-of-thought)	Accuracy	42.9	# 39
Common Sense Reasoning	ARC (Challenge)	UL2 20B (zero-shot)	Accuracy	29.8	# 47
Common Sense Reasoning	ARC (Challenge)	UL2 20B (chain-of-thought + self-consistency)	Accuracy	49.5	# 31
Common Sense Reasoning	ARC (Easy)	UL2 20B (0-shot)	Accuracy	32.2	# 42
Common Sense Reasoning	ARC (Easy)	UL2 20B (chain-of-thought)	Accuracy	38.4	# 40
Common Sense Reasoning	ARC (Easy)	UL2 20B (chain-of-thought + self-consistency)	Accuracy	69.8	# 30
Question Answering	BoolQ	UL2 20B (0-shot)	Accuracy	63.1	# 45
Question Answering	BoolQ	UL2 20B (fine-tuned)	Accuracy	90.8	# 6
Common Sense Reasoning	CommonsenseQA	UL2 20B (chain-of-thought)	Accuracy	51.4	# 33
Common Sense Reasoning	CommonsenseQA	UL2 20B (chain-of-thought + self-consistency)	Accuracy	55.7	# 31
Common Sense Reasoning	CommonsenseQA	UL2 20B (zero-shot)	Accuracy	34.2	# 35
Question Answering	COPA	UL2 20B (0-shot)	Accuracy	85	# 29
Question Answering	COPA	UL2 20B (fine-tuned)	Accuracy	99	# 4
Arithmetic Reasoning	GSM8K	UL2 20B (0-shot)	Accuracy	4.1	# 151
Arithmetic Reasoning	GSM8K	UL2 20B (0-shot)	Parameters (Billion)	20	# 68
Arithmetic Reasoning	GSM8K	UL2 20B (chain-of-thought)	Accuracy	4.4	# 150
Arithmetic Reasoning	GSM8K	UL2 20B (chain-of-thought)	Parameters (Billion)	20	# 68
Multi-task Language Understanding	MMLU	UL2 20B (5-shot)	Average (%)	39.2	# 78
Multi-task Language Understanding	MMLU	FLAN-UL2 20B (chain-of-thought)	Average (%)	52.2	# 64
Multi-task Language Understanding	MMLU	FLAN-UL2 20B (5-shot)	Average (%)	55.7	# 57
Natural Language Inference	RTE	UL2 20B (fine-tuned)	Accuracy	92.1%	# 10
Natural Language Inference	RTE	UL2 20B (0-shot)	Accuracy	60.7%	# 71
Long-range modeling	SCROLLS	UL2	GovRep	53.6 / 26.1 / 28.8	# 8
Long-range modeling	SCROLLS	UL2	SumScr	32.9 / 7.8 / 19.4	# 8
Long-range modeling	SCROLLS	UL2	QMSum	31.1 / 8.5 / 20.4	# 8
Long-range modeling	SCROLLS	UL2	Qspr	37.6	# 7
Long-range modeling	SCROLLS	UL2	Nrtv	24.2	# 5
Long-range modeling	SCROLLS	UL2	QALT EM-T/H	45.8 / 40.7	# 2
Long-range modeling	SCROLLS	UL2	Avg.	37.87	# 7
Long-range modeling	SCROLLS	UL2 20B	CNLI	88.7	# 1
Coreference Resolution	Winograd Schema Challenge	UL2 20B (fine-tuned)	Accuracy	98.1	# 3
Coreference Resolution	Winograd Schema Challenge	UL2 20B (0-shot)	Accuracy	79.9	# 23
Word Sense Disambiguation	Words in Context	UL2 20B (0-shot)	Accuracy	49.8	# 34
Word Sense Disambiguation	Words in Context	UL2 20B (fine-tuned)	Accuracy	77.3	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unifying-language-learning-paradigms/long-range-modeling-on-scrolls)](https://paperswithcode.com/sota/long-range-modeling-on-scrolls?p=unifying-language-learning-paradigms)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unifying-language-learning-paradigms/coreference-resolution-on-winograd-schema)](https://paperswithcode.com/sota/coreference-resolution-on-winograd-schema?p=unifying-language-learning-paradigms)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unifying-language-learning-paradigms/question-answering-on-copa)](https://paperswithcode.com/sota/question-answering-on-copa?p=unifying-language-learning-paradigms)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unifying-language-learning-paradigms/question-answering-on-boolq)](https://paperswithcode.com/sota/question-answering-on-boolq?p=unifying-language-learning-paradigms)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unifying-language-learning-paradigms/word-sense-disambiguation-on-words-in-context)](https://paperswithcode.com/sota/word-sense-disambiguation-on-words-in-context?p=unifying-language-learning-paradigms)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unifying-language-learning-paradigms/natural-language-inference-on-rte)](https://paperswithcode.com/sota/natural-language-inference-on-rte?p=unifying-language-learning-paradigms)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unifying-language-learning-paradigms/common-sense-reasoning-on-arc-easy)](https://paperswithcode.com/sota/common-sense-reasoning-on-arc-easy?p=unifying-language-learning-paradigms)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unifying-language-learning-paradigms/common-sense-reasoning-on-arc-challenge)](https://paperswithcode.com/sota/common-sense-reasoning-on-arc-challenge?p=unifying-language-learning-paradigms)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unifying-language-learning-paradigms/common-sense-reasoning-on-commonsenseqa)](https://paperswithcode.com/sota/common-sense-reasoning-on-commonsenseqa?p=unifying-language-learning-paradigms)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unifying-language-learning-paradigms/multi-task-language-understanding-on-mmlu)](https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu?p=unifying-language-learning-paradigms)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unifying-language-learning-paradigms/arithmetic-reasoning-on-gsm8k)](https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k?p=unifying-language-learning-paradigms)`

UL2: Unifying Language Learning Paradigms

10 May 2022 · Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Siamak Shakeri, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby, Donald Metzler ·

Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized & unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 & GPT-like models across multiple diverse setups. By scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised finetuning based NLP tasks. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. On 0-shot MMLU, UL2 20B outperforms T0 and T5 models. UL2 20B also works well with chain-of-thought prompting and reasoning, making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters. Finally, we apply FLAN instruction tuning to the UL2 20B model, achieving MMLU and Big-Bench scores competitive to FLAN-PaLM 62B. We release Flax-based T5X checkpoints for the UL2 20B & Flan-UL2 20B.

PDF Abstract

Code

Add Remove Mark official

google-research/google-research official

32,819

Tasks

Add Remove

Arithmetic Reasoning

Common Sense Reasoning

Coreference Resolution

In-Context Learning

Information Retrieval

Long-range modeling

Multi-task Language Understanding

Natural Language Inference

Question Answering

Retrieval

Text Classification

Text Generation

Word Sense Disambiguation

Datasets

GLUE

IMDb Movie Reviews

AG News

MMLU

GSM8K

HellaSwag

BoolQ

SuperGLUE

RACE

OpenBookQA

CommonsenseQA

WSC

COPA

WikiSQL

BIG-bench

SVAMP

StrategyQA SGD

NarrativeQA Civil Comments SAMSum

Multi-News

QASC ASDiv CommonGen

CosmosQA

ARC (AI2 Reasoning Challenge) MAWPS MTOP HybridQA

QuALITY RTE

QMSum

ToTTo

GEM DART

VitaminC

SCROLLS

SQA ContractNLI

TweetQA DocNLI

Results from the Paper

Edit

Ranked #1 on Long-range modeling on SCROLLS (CNLI metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Common Sense Reasoning	ARC (Challenge)	UL2 20B (chain-of-thought)	Accuracy	42.9	# 39	Compare
Common Sense Reasoning	ARC (Challenge)	UL2 20B (zero-shot)	Accuracy	29.8	# 47	Compare
Common Sense Reasoning	ARC (Challenge)	UL2 20B (chain-of-thought + self-consistency)	Accuracy	49.5	# 31	Compare
Common Sense Reasoning	ARC (Easy)	UL2 20B (0-shot)	Accuracy	32.2	# 42	Compare
Common Sense Reasoning	ARC (Easy)	UL2 20B (chain-of-thought)	Accuracy	38.4	# 40	Compare
Common Sense Reasoning	ARC (Easy)	UL2 20B (chain-of-thought + self-consistency)	Accuracy	69.8	# 30	Compare
Question Answering	BoolQ	UL2 20B (0-shot)	Accuracy	63.1	# 45	Compare
Question Answering	BoolQ	UL2 20B (fine-tuned)	Accuracy	90.8	# 6	Compare
Common Sense Reasoning	CommonsenseQA	UL2 20B (chain-of-thought)	Accuracy	51.4	# 33	Compare
Common Sense Reasoning	CommonsenseQA	UL2 20B (chain-of-thought + self-consistency)	Accuracy	55.7	# 31	Compare
Common Sense Reasoning	CommonsenseQA	UL2 20B (zero-shot)	Accuracy	34.2	# 35	Compare
Question Answering	COPA	UL2 20B (0-shot)	Accuracy	85	# 29	Compare
Question Answering	COPA	UL2 20B (fine-tuned)	Accuracy	99	# 4	Compare
Arithmetic Reasoning	GSM8K	UL2 20B (0-shot)	Accuracy	4.1	# 151	Compare
Arithmetic Reasoning	GSM8K	UL2 20B (0-shot)	Parameters (Billion)	20	# 68	Compare
Arithmetic Reasoning	GSM8K	UL2 20B (chain-of-thought)	Accuracy	4.4	# 150	Compare
Arithmetic Reasoning	GSM8K	UL2 20B (chain-of-thought)	Parameters (Billion)	20	# 68	Compare
Multi-task Language Understanding	MMLU	UL2 20B (5-shot)	Average (%)	39.2	# 78	Compare
Multi-task Language Understanding	MMLU	FLAN-UL2 20B (chain-of-thought)	Average (%)	52.2	# 64	Compare
Multi-task Language Understanding	MMLU	FLAN-UL2 20B (5-shot)	Average (%)	55.7	# 57	Compare
Natural Language Inference	RTE	UL2 20B (fine-tuned)	Accuracy	92.1%	# 10	Compare
Natural Language Inference	RTE	UL2 20B (0-shot)	Accuracy	60.7%	# 71	Compare
Long-range modeling	SCROLLS	UL2	GovRep	53.6 / 26.1 / 28.8	# 8	Compare
			SumScr	32.9 / 7.8 / 19.4	# 8	Compare
			QMSum	31.1 / 8.5 / 20.4	# 8	Compare
			Qspr	37.6	# 7	Compare
			Nrtv	24.2	# 5	Compare
			QALT EM-T/H	45.8 / 40.7	# 2	Compare
			Avg.	37.87	# 7	Compare
Long-range modeling	SCROLLS	UL2 20B	CNLI	88.7	# 1	Compare
Coreference Resolution	Winograd Schema Challenge	UL2 20B (fine-tuned)	Accuracy	98.1	# 3	Compare
Coreference Resolution	Winograd Schema Challenge	UL2 20B (0-shot)	Accuracy	79.9	# 23	Compare
Word Sense Disambiguation	Words in Context	UL2 20B (0-shot)	Accuracy	49.8	# 34	Compare
Word Sense Disambiguation	Words in Context	UL2 20B (fine-tuned)	Accuracy	77.3	# 6	Compare

Methods

Add Remove

Adafactor • Adam • Attention Dropout • BPE • Cosine Annealing • Dense Connections • Dropout • Fixed Factorized Attention • GELU • GLU • GPT-3 • Inverse Square Root Schedule • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • SentencePiece • Softmax • Strided Attention • T5 • UL2 • Weight Decay

Edit Social Preview

UL2: Unifying Language Learning Paradigms

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove