The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning

23 May 2023  ยท  Seungone Kim, Se June Joo, Doyoung Kim, Joel Jang, Seonghyeon Ye, Jamin Shin, Minjoon Seo ยท

Language models (LMs) with less than 100B parameters are known to perform poorly on chain-of-thought (CoT) reasoning in contrast to large LMs when solving unseen tasks. In this work, we aim to equip smaller LMs with the step-by-step reasoning capability by instruction tuning with CoT rationales. In order to achieve this goal, we first introduce a new instruction-tuning dataset called the CoT Collection, which augments the existing Flan Collection (including only 9 CoT tasks) with additional 1.84 million rationales across 1,060 tasks. We show that CoT fine-tuning Flan-T5 (3B & 11B) with CoT Collection enables smaller LMs to have better CoT capabilities on unseen tasks. On the BIG-Bench-Hard (BBH) benchmark, we report an average improvement of +4.34% (Flan-T5 3B) and +2.60% (Flan-T5 11B), in terms of zero-shot task accuracy. Furthermore, we show that instruction tuning with CoT Collection allows LMs to possess stronger few-shot learning capabilities on 4 domain-specific tasks, resulting in an improvement of +2.24% (Flan-T5 3B) and +2.37% (Flan-T5 11B), even outperforming ChatGPT utilizing demonstrations until the max length by a +13.98% margin. Our code, the CoT Collection data, and model checkpoints are publicly available.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Natural Language Inference ANLI test T0-3B (CoT fine-tuned) A1 41.7 # 11
A2 37.2 # 17
A3 41.9 # 17
Big-bench Hard CoT-T5 11B Accuracy 48 # 1
BIG-bench (Hyperbaton) CoT-T5 11B Accuracy 65.2 # 1
BIG-bench (Navigate) CoT-T5 11B Accuracy 60 # 1
BIG-bench (Ruin Names) CoT-T5 11B Accuracy 42.8 # 1
BIG-bench (SNARKS) CoT-T5 11B Accuracy 67.7 # 1
Few-Shot Learning CaseHOLD CoT-T5-11B (1024 Shot) Accuracy 68.3 # 1
Question Answering COPA T0-3B (CoT fine-tuned) Accuracy 90.9 # 16
Sentence Completion HellaSwag T0-3B (CoT fine-tuned) Accuracy 41.1 # 71
Few-Shot Learning MedNLI CoT-T5-11B (1024 Shot) Accuracy 78.02 # 1
Question Answering PubMedQA CoT-T5-11B (1024 Shot) Accuracy 73.42 # 16
Few-Shot Learning PubMedQA CoT-T5-11B (1024 Shot) Accuracy 73.42 # 1
Natural Language Inference RTE T0-3B (CoT fine-tuned) Accuracy 80.8% # 34
Question Answering StoryCloze T0-3B (CoT fine-tuned) Accuracy 94.5 # 4
Coreference Resolution Winograd Schema Challenge T0-3B (CoT fine-tuned) Accuracy 66 # 41
Common Sense Reasoning WinoGrande T0-3B (CoT fine-tuned) Accuracy 57.5 # 53
Word Sense Disambiguation Words in Context T0-3B (CoT fine-tuned) Accuracy 56.7 # 21

Methods