Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning

4 Mar 2024  ·  Yiming Huang, Xiao Liu, Yeyun Gong, Zhibin Gou, Yelong Shen, Nan Duan, Weizhu Chen ·

Large language models (LLMs) have shown great potential in complex reasoning tasks, yet their performance is often hampered by the scarcity of high-quality and reasoning-focused training datasets. Addressing this challenge, we propose Key-Point-Driven Data Synthesis (KPDDS), a novel data synthesis framework that synthesizes question-answer pairs by leveraging key points and exemplar practices from authentic data sources. KPDDS ensures the generation of novel questions with rigorous quality control and substantial scalability. As a result, we present KPMath, an extensive synthetic dataset tailored for mathematical reasoning, comprising over 800K question-answer pairs. Utilizing KPMath and augmenting it with additional reasoning-intensive corpora, we create the comprehensive KPMath-Plus dataset. The fine-tuned DeepSeekMath model on KPMath-Plus achieves zero-shot PASS@1 accuracies of 83.9% on GSM8K and 48.8% on MATH, and also reaches promising performance on other math reasoning datasets, outperforming competitors in the 7B to 70B range.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Math Word Problem Solving MATH Llemma-34B-KPMath-Plus Accuracy 48.6 # 31
Parameters (Billions) 34 # 26
Math Word Problem Solving MATH Llama2-13B-KPMath-Plus Accuracy 41 # 52
Parameters (Billions) 13 # 38
Math Word Problem Solving MATH DeepSeekMath-7B-KPMath-Plus Accuracy 48.8 # 29
Parameters (Billions) 7 # 58
Math Word Problem Solving MATH Mistral-7B-KPMath-Plus Accuracy 46.8 # 36
Parameters (Billions) 7 # 58

Methods


No methods listed for this paper. Add relevant methods here