Orca 1 learns from rich signals, such as explanation traces, allowing it to outperform conventional instruction-tuned models on benchmarks like BigBench Hard and AGIEval. In Orca 2, we continue exploring how improved training signals can enhance smaller LMs' reasoning abilities. Research on training small LMs has often relied on imitation learning to replicate the output of more capable models. We contend that excessive emphasis on imitation may restrict the potential of smaller models. We seek to teach small LMs to employ different solution strategies for different tasks, potentially different from the one used by the larger model. For example, while larger models might provide a direct answer to a complex task, smaller models may not have the same capacity. In Orca 2, we teach the model various reasoning techniques (step-by-step, recall then generate, recall-reason-generate, direct answer, etc.). More crucially, we aim to help the model learn to determine the most effective solution strategy for each task. We evaluate Orca 2 using a comprehensive set of 15 diverse benchmarks (corresponding to approximately 100 tasks and over 36,000 unique prompts). Orca 2 significantly surpasses models of similar size and attains performance levels similar or better to those of models 5-10x larger, as assessed on complex tasks that test advanced reasoning abilities in zero-shot settings. make Orca 2 weights publicly available at aka.ms/orca-lm to support research on the development, evaluation, and alignment of smaller LMs

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Question Answering AGI Eval Orca 2-13B Accuracy 49.93 # 1
Question Answering AGI Eval Orca 2-7B Accuracy 45.1 # 2
Multi-task Language Understanding BBH-nlp Orca 2-13B Average (%) 50.18 # 8
Multi-task Language Understanding BBH-nlp Orca 2-7B Average (%) 45.93 # 9
Crass AI BIG-bench Orca 2-13B Accuracy 86.86 # 1
Crass AI BIG-bench Orca 2-7B Accuracy 84.31 # 2
Question Answering DROP Test Orca 2-13B F1 57.97 # 13
Question Answering DROP Test Orca 2-7B F1 60.26 # 12
Arithmetic Reasoning GSM8K Orca 2 7B Accuracy 47.23 # 125
Parameters (Billion) 7 # 10
Arithmetic Reasoning GSM8K Orca 2 13B Accuracy 59.14 # 107
Parameters (Billion) 13 # 53
Reading Comprehension RACE Orca 2-13B Accuracy 82.87 # 8
Reading Comprehension RACE Orca 2-7B Accuracy 80.79 # 9

Methods


No methods listed for this paper. Add relevant methods here