TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Multiple Choice Question Answering (MCQA)	MedMCQA	Codex 5-shot CoT	Dev Set (Acc-%)	0.597	# 2
Multiple Choice Question Answering (MCQA)	MedMCQA	Codex 5-shot CoT	Test Set (Acc-%)	0.627	# 5
Question Answering	MedQA	Codex 5-shot CoT	Accuracy	60.2	# 11
Question Answering	PubMedQA	Codex 5-shot CoT	Accuracy	78.2	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/can-large-language-models-reason-about/multiple-choice-question-answering-mcqa-on-21)](https://paperswithcode.com/sota/multiple-choice-question-answering-mcqa-on-21?p=can-large-language-models-reason-about)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/can-large-language-models-reason-about/question-answering-on-pubmedqa)](https://paperswithcode.com/sota/question-answering-on-pubmedqa?p=can-large-language-models-reason-about)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/can-large-language-models-reason-about/question-answering-on-medqa-usmle)](https://paperswithcode.com/sota/question-answering-on-medqa-usmle?p=can-large-language-models-reason-about)`

Can large language models reason about medical questions?

17 Jul 2022 · Valentin Liévin, Christoffer Egeberg Hother, Andreas Geert Motzfeldt, Ole Winther ·

Although large language models (LLMs) often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether close- and open-source models (GPT-3.5, LLama-2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-USMLE, MedMCQA, and PubMedQA) and multiple prompting scenarios: Chain-of-Thought (CoT, think step-by-step), few-shot and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions, but also reaches the passing score on three datasets: MedQA-USMLE 60.2%, MedMCQA 62.7% and PubMedQA 78.2%. Open-source models are closing the gap: Llama-2 70B also passed the MedQA-USMLE with 62.5% accuracy.

PDF Abstract

Code

Add Remove Mark official

vlievin/medical-reasoning official

Tasks

Add Remove

Multiple-choice

Multiple Choice Question Answering (MCQA)

Prompt Engineering

Question Answering

Reading Comprehension

Retrieval

Datasets

MMLU

PubMedQA

MedQA

MedMCQA

Results from the Paper

Edit

Ranked #5 on Question Answering on PubMedQA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Multiple Choice Question Answering (MCQA)	MedMCQA	Codex 5-shot CoT	Dev Set (Acc-%)	0.597	# 2	Compare
Multiple Choice Question Answering (MCQA)	MedMCQA	Codex 5-shot CoT	Test Set (Acc-%)	0.627	# 5	Compare
Question Answering	MedQA	Codex 5-shot CoT	Accuracy	60.2	# 11	Compare
Question Answering	PubMedQA	Codex 5-shot CoT	Accuracy	78.2	# 5	Compare

Methods

Add Remove

Adam • Attention Dropout • BPE • Cosine Annealing • Dense Connections • Dropout • Fixed Factorized Attention • GELU • GPT-3 • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Strided Attention • Weight Decay

Edit Social Preview

Can large language models reason about medical questions?

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove