TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Natural Language Inference	ANLI test	T0-3B (CoT fine-tuned)	A1	41.7	# 11
Natural Language Inference	ANLI test	T0-3B (CoT fine-tuned)	A2	37.2	# 17
Natural Language Inference	ANLI test	T0-3B (CoT fine-tuned)	A3	41.9	# 17
	Big-bench Hard	CoT-T5 11B	Accuracy	48	# 1
	BIG-bench (Hyperbaton)	CoT-T5 11B	Accuracy	65.2	# 1
	BIG-bench (Navigate)	CoT-T5 11B	Accuracy	60	# 1
	BIG-bench (Ruin Names)	CoT-T5 11B	Accuracy	42.8	# 1
	BIG-bench (SNARKS)	CoT-T5 11B	Accuracy	67.7	# 1
Few-Shot Learning	CaseHOLD	CoT-T5-11B (1024 Shot)	Accuracy	68.3	# 1
Question Answering	COPA	T0-3B (CoT fine-tuned)	Accuracy	90.9	# 16
Sentence Completion	HellaSwag	T0-3B (CoT fine-tuned)	Accuracy	41.1	# 71
Few-Shot Learning	MedNLI	CoT-T5-11B (1024 Shot)	Accuracy	78.02	# 1
Question Answering	PubMedQA	CoT-T5-11B (1024 Shot)	Accuracy	73.42	# 16
Few-Shot Learning	PubMedQA	CoT-T5-11B (1024 Shot)	Accuracy	73.42	# 1
Natural Language Inference	RTE	T0-3B (CoT fine-tuned)	Accuracy	80.8%	# 34
Question Answering	StoryCloze	T0-3B (CoT fine-tuned)	Accuracy	94.5	# 4
Coreference Resolution	Winograd Schema Challenge	T0-3B (CoT fine-tuned)	Accuracy	66	# 41
Common Sense Reasoning	WinoGrande	T0-3B (CoT fine-tuned)	Accuracy	57.5	# 53
Word Sense Disambiguation	Words in Context	T0-3B (CoT fine-tuned)	Accuracy	56.7	# 21

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-cot-collection-improving-zero-shot-and/on-big-bench-hard)](https://paperswithcode.com/sota/on-big-bench-hard?p=the-cot-collection-improving-zero-shot-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-cot-collection-improving-zero-shot-and/on-big-bench-hyperbaton)](https://paperswithcode.com/sota/on-big-bench-hyperbaton?p=the-cot-collection-improving-zero-shot-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-cot-collection-improving-zero-shot-and/on-big-bench-navigate)](https://paperswithcode.com/sota/on-big-bench-navigate?p=the-cot-collection-improving-zero-shot-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-cot-collection-improving-zero-shot-and/on-big-bench-ruin-names)](https://paperswithcode.com/sota/on-big-bench-ruin-names?p=the-cot-collection-improving-zero-shot-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-cot-collection-improving-zero-shot-and/on-big-bench-snarks)](https://paperswithcode.com/sota/on-big-bench-snarks?p=the-cot-collection-improving-zero-shot-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-cot-collection-improving-zero-shot-and/few-shot-learning-on-casehold)](https://paperswithcode.com/sota/few-shot-learning-on-casehold?p=the-cot-collection-improving-zero-shot-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-cot-collection-improving-zero-shot-and/few-shot-learning-on-mednli)](https://paperswithcode.com/sota/few-shot-learning-on-mednli?p=the-cot-collection-improving-zero-shot-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-cot-collection-improving-zero-shot-and/few-shot-learning-on-pubmedqa)](https://paperswithcode.com/sota/few-shot-learning-on-pubmedqa?p=the-cot-collection-improving-zero-shot-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-cot-collection-improving-zero-shot-and/question-answering-on-storycloze)](https://paperswithcode.com/sota/question-answering-on-storycloze?p=the-cot-collection-improving-zero-shot-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-cot-collection-improving-zero-shot-and/natural-language-inference-on-anli-test)](https://paperswithcode.com/sota/natural-language-inference-on-anli-test?p=the-cot-collection-improving-zero-shot-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-cot-collection-improving-zero-shot-and/question-answering-on-copa)](https://paperswithcode.com/sota/question-answering-on-copa?p=the-cot-collection-improving-zero-shot-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-cot-collection-improving-zero-shot-and/question-answering-on-pubmedqa)](https://paperswithcode.com/sota/question-answering-on-pubmedqa?p=the-cot-collection-improving-zero-shot-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-cot-collection-improving-zero-shot-and/word-sense-disambiguation-on-words-in-context)](https://paperswithcode.com/sota/word-sense-disambiguation-on-words-in-context?p=the-cot-collection-improving-zero-shot-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-cot-collection-improving-zero-shot-and/natural-language-inference-on-rte)](https://paperswithcode.com/sota/natural-language-inference-on-rte?p=the-cot-collection-improving-zero-shot-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-cot-collection-improving-zero-shot-and/coreference-resolution-on-winograd-schema)](https://paperswithcode.com/sota/coreference-resolution-on-winograd-schema?p=the-cot-collection-improving-zero-shot-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-cot-collection-improving-zero-shot-and/common-sense-reasoning-on-winogrande)](https://paperswithcode.com/sota/common-sense-reasoning-on-winogrande?p=the-cot-collection-improving-zero-shot-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-cot-collection-improving-zero-shot-and/sentence-completion-on-hellaswag)](https://paperswithcode.com/sota/sentence-completion-on-hellaswag?p=the-cot-collection-improving-zero-shot-and)`

The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning

23 May 2023 · Seungone Kim, Se June Joo, Doyoung Kim, Joel Jang, Seonghyeon Ye, Jamin Shin, Minjoon Seo ·

Language models (LMs) with less than 100B parameters are known to perform poorly on chain-of-thought (CoT) reasoning in contrast to large LMs when solving unseen tasks. In this work, we aim to equip smaller LMs with the step-by-step reasoning capability by instruction tuning with CoT rationales. In order to achieve this goal, we first introduce a new instruction-tuning dataset called the CoT Collection, which augments the existing Flan Collection (including only 9 CoT tasks) with additional 1.84 million rationales across 1,060 tasks. We show that CoT fine-tuning Flan-T5 (3B & 11B) with CoT Collection enables smaller LMs to have better CoT capabilities on unseen tasks. On the BIG-Bench-Hard (BBH) benchmark, we report an average improvement of +4.34% (Flan-T5 3B) and +2.60% (Flan-T5 11B), in terms of zero-shot task accuracy. Furthermore, we show that instruction tuning with CoT Collection allows LMs to possess stronger few-shot learning capabilities on 4 domain-specific tasks, resulting in an improvement of +2.24% (Flan-T5 3B) and +2.37% (Flan-T5 11B), even outperforming ChatGPT utilizing demonstrations until the max length by a +13.98% margin. Our code, the CoT Collection data, and model checkpoints are publicly available.

PDF Abstract

Code

Add Remove Mark official

kaistai/cot-collection official

189

kaist-lklab/cot-collection official

189

Tasks

Add Remove

Common Sense Reasoning

Common Sense Reasoning (Zero-Shot)

Coreference Resolution

Few-Shot Learning

Natural Language Inference

Natural Language Inference (Zero-Shot)

Question Answering

Sentence Completion

Word Sense Disambiguation

Datasets

GLUE

HellaSwag

WinoGrande

WSC

COPA

ANLI

BIG-bench BBH

PubMedQA RTE StoryCloze MGSM CaseHOLD

MedNLI

Results from the Paper

Edit

Ranked #1 on on BIG-bench (SNARKS)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Natural Language Inference	ANLI test	T0-3B (CoT fine-tuned)	A1	41.7	# 11	Compare
			A2	37.2	# 17	Compare
			A3	41.9	# 17	Compare
	Big-bench Hard	CoT-T5 11B	Accuracy	48	# 1	Compare
	BIG-bench (Hyperbaton)	CoT-T5 11B	Accuracy	65.2	# 1	Compare
	BIG-bench (Navigate)	CoT-T5 11B	Accuracy	60	# 1	Compare
	BIG-bench (Ruin Names)	CoT-T5 11B	Accuracy	42.8	# 1	Compare
	BIG-bench (SNARKS)	CoT-T5 11B	Accuracy	67.7	# 1	Compare
Few-Shot Learning	CaseHOLD	CoT-T5-11B (1024 Shot)	Accuracy	68.3	# 1	Compare
Question Answering	COPA	T0-3B (CoT fine-tuned)	Accuracy	90.9	# 16	Compare
Sentence Completion	HellaSwag	T0-3B (CoT fine-tuned)	Accuracy	41.1	# 71	Compare
Few-Shot Learning	MedNLI	CoT-T5-11B (1024 Shot)	Accuracy	78.02	# 1	Compare
Question Answering	PubMedQA	CoT-T5-11B (1024 Shot)	Accuracy	73.42	# 16	Compare
Few-Shot Learning	PubMedQA	CoT-T5-11B (1024 Shot)	Accuracy	73.42	# 1	Compare
Natural Language Inference	RTE	T0-3B (CoT fine-tuned)	Accuracy	80.8%	# 34	Compare
Question Answering	StoryCloze	T0-3B (CoT fine-tuned)	Accuracy	94.5	# 4	Compare
Coreference Resolution	Winograd Schema Challenge	T0-3B (CoT fine-tuned)	Accuracy	66	# 41	Compare
Common Sense Reasoning	WinoGrande	T0-3B (CoT fine-tuned)	Accuracy	57.5	# 53	Compare
Word Sense Disambiguation	Words in Context	T0-3B (CoT fine-tuned)	Accuracy	56.7	# 21	Compare

Methods

Add Remove

Flan-T5

Edit Social Preview

The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove