TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Question Answering	AGI Eval	Orca 2-13B	Accuracy	49.93	# 1
Question Answering	AGI Eval	Orca 2-7B	Accuracy	45.1	# 2
Multi-task Language Understanding	BBH-nlp	Orca 2-13B	Average (%)	50.18	# 8
Multi-task Language Understanding	BBH-nlp	Orca 2-7B	Average (%)	45.93	# 9
Crass AI	BIG-bench	Orca 2-13B	Accuracy	86.86	# 1
Crass AI	BIG-bench	Orca 2-7B	Accuracy	84.31	# 2
Question Answering	DROP Test	Orca 2-13B	F1	57.97	# 13
Question Answering	DROP Test	Orca 2-7B	F1	60.26	# 12
Arithmetic Reasoning	GSM8K	Orca 2 7B	Accuracy	47.23	# 125
Arithmetic Reasoning	GSM8K	Orca 2 7B	Parameters (Billion)	7	# 10
Arithmetic Reasoning	GSM8K	Orca 2 13B	Accuracy	59.14	# 107
Arithmetic Reasoning	GSM8K	Orca 2 13B	Parameters (Billion)	13	# 53
Reading Comprehension	RACE	Orca 2-13B	Accuracy	82.87	# 8
Reading Comprehension	RACE	Orca 2-7B	Accuracy	80.79	# 9

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/orca-2-teaching-small-language-models-how-to/question-answering-on-agi-eval)](https://paperswithcode.com/sota/question-answering-on-agi-eval?p=orca-2-teaching-small-language-models-how-to)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/orca-2-teaching-small-language-models-how-to/crass-ai-on-big-bench)](https://paperswithcode.com/sota/crass-ai-on-big-bench?p=orca-2-teaching-small-language-models-how-to)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/orca-2-teaching-small-language-models-how-to/multi-task-language-understanding-on-bbh-nlp)](https://paperswithcode.com/sota/multi-task-language-understanding-on-bbh-nlp?p=orca-2-teaching-small-language-models-how-to)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/orca-2-teaching-small-language-models-how-to/reading-comprehension-on-race)](https://paperswithcode.com/sota/reading-comprehension-on-race?p=orca-2-teaching-small-language-models-how-to)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/orca-2-teaching-small-language-models-how-to/question-answering-on-drop-test)](https://paperswithcode.com/sota/question-answering-on-drop-test?p=orca-2-teaching-small-language-models-how-to)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/orca-2-teaching-small-language-models-how-to/arithmetic-reasoning-on-gsm8k)](https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k?p=orca-2-teaching-small-language-models-how-to)`

Orca 2: Teaching Small Language Models How to Reason

18 Nov 2023 · Arindam Mitra, Luciano del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen, Anastasia Razdaibiedina, Erik Jones, Kriti Aggarwal, Hamid Palangi, Guoqing Zheng, Corby Rosset, Hamed Khanpour, Ahmed Awadallah ·

Orca 1 learns from rich signals, such as explanation traces, allowing it to outperform conventional instruction-tuned models on benchmarks like BigBench Hard and AGIEval. In Orca 2, we continue exploring how improved training signals can enhance smaller LMs' reasoning abilities. Research on training small LMs has often relied on imitation learning to replicate the output of more capable models. We contend that excessive emphasis on imitation may restrict the potential of smaller models. We seek to teach small LMs to employ different solution strategies for different tasks, potentially different from the one used by the larger model. For example, while larger models might provide a direct answer to a complex task, smaller models may not have the same capacity. In Orca 2, we teach the model various reasoning techniques (step-by-step, recall then generate, recall-reason-generate, direct answer, etc.). More crucially, we aim to help the model learn to determine the most effective solution strategy for each task. We evaluate Orca 2 using a comprehensive set of 15 diverse benchmarks (corresponding to approximately 100 tasks and over 36,000 unique prompts). Orca 2 significantly surpasses models of similar size and attains performance levels similar or better to those of models 5-10x larger, as assessed on complex tasks that test advanced reasoning abilities in zero-shot settings. make Orca 2 weights publicly available at aka.ms/orca-lm to support research on the development, evaluation, and alignment of smaller LMs

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Arithmetic Reasoning

Common Sense Reasoning

counterfactual

Counterfactual Reasoning

Crass AI

Imitation Learning

Mathematical Reasoning

Multi-task Language Understanding

Question Answering

Reading Comprehension

Datasets

MS MARCO

MMLU

GSM8K

HellaSwag

MATH

RACE

DROP

TruthfulQA

BIG-bench

LAMBADA MT-Bench BBH

ROCStories

ARC (AI2 Reasoning Challenge)

ToxiGen NumGLUE

Results from the Paper

Edit

Ranked #1 on Crass AI on BIG-bench

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Question Answering	AGI Eval	Orca 2-13B	Accuracy	49.93	# 1	Compare
Question Answering	AGI Eval	Orca 2-7B	Accuracy	45.1	# 2	Compare
Multi-task Language Understanding	BBH-nlp	Orca 2-13B	Average (%)	50.18	# 8	Compare
Multi-task Language Understanding	BBH-nlp	Orca 2-7B	Average (%)	45.93	# 9	Compare
Crass AI	BIG-bench	Orca 2-13B	Accuracy	86.86	# 1	Compare
Crass AI	BIG-bench	Orca 2-7B	Accuracy	84.31	# 2	Compare
Question Answering	DROP Test	Orca 2-13B	F1	57.97	# 13	Compare
Question Answering	DROP Test	Orca 2-7B	F1	60.26	# 12	Compare
Arithmetic Reasoning	GSM8K	Orca 2 7B	Accuracy	47.23	# 125	Compare
Arithmetic Reasoning	GSM8K	Orca 2 7B	Parameters (Billion)	7	# 10	Compare
Arithmetic Reasoning	GSM8K	Orca 2 13B	Accuracy	59.14	# 107	Compare
Arithmetic Reasoning	GSM8K	Orca 2 13B	Parameters (Billion)	13	# 53	Compare
Reading Comprehension	RACE	Orca 2-13B	Accuracy	82.87	# 8	Compare
Reading Comprehension	RACE	Orca 2-7B	Accuracy	80.79	# 9	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Orca 2: Teaching Small Language Models How to Reason

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove