TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Code Generation	Turbulence	GPT-4	CorrSc	0.848	# 1
Code Generation	Turbulence	CodeLlama:13B-4bit-quantised	CorrSc	0.327	# 3
Code Generation	Turbulence	CodeLlama:7B-4bit-quantised	CorrSc	0.289	# 4
Code Generation	Turbulence	Command	CorrSc	0.063	# 5
Code Generation	Turbulence	GPT-3.5-Turbo	CorrSc	0.617	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/turbulence-systematically-and-automatically/code-generation-on-turbulence)](https://paperswithcode.com/sota/code-generation-on-turbulence?p=turbulence-systematically-and-automatically)`

Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code

22 Dec 2023 · Shahin Honarvar, Mark van der Wilk, Alastair Donaldson ·

We present a method for systematically evaluating the correctness and robustness of instruction-tuned large language models (LLMs) for code generation via a new benchmark, Turbulence. Turbulence consists of a large set of natural language $\textit{question templates}$, each of which is a programming problem, parameterised so that it can be asked in many different forms. Each question template has an associated $\textit{test oracle}$ that judges whether a code solution returned by an LLM is correct. Thus, from a single question template, it is possible to ask an LLM a $\textit{neighbourhood}$ of very similar programming questions, and assess the correctness of the result returned for each question. This allows gaps in an LLM's code generation abilities to be identified, including $\textit{anomalies}$ where the LLM correctly solves $\textit{almost all}$ questions in a neighbourhood but fails for particular parameter instantiations. We present experiments against five LLMs from OpenAI, Cohere and Meta, each at two temperature configurations. Our findings show that, across the board, Turbulence is able to reveal gaps in LLM reasoning ability. This goes beyond merely highlighting that LLMs sometimes produce wrong code (which is no surprise): by systematically identifying cases where LLMs are able to solve some problems in a neighbourhood but do not manage to generalise to solve the whole neighbourhood, our method is effective at highlighting $\textit{robustness}$ issues. We present data and examples that shed light on the kinds of mistakes that LLMs make when they return incorrect code results.

PDF Abstract

Code

Add Remove Mark official

shahinhonarvar/turbulence-benchmark official

Tasks

Add Remove

Code Generation

Datasets

Introduced in the Paper:

Turbulence

Results from the Paper

Add Remove

Ranked #1 on Code Generation on Turbulence

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Code Generation	Turbulence	GPT-4	CorrSc	0.848	# 1	Compare
Code Generation	Turbulence	CodeLlama:13B-4bit-quantised	CorrSc	0.327	# 3	Compare
Code Generation	Turbulence	CodeLlama:7B-4bit-quantised	CorrSc	0.289	# 4	Compare
Code Generation	Turbulence	Command	CorrSc	0.063	# 5	Compare
Code Generation	Turbulence	GPT-3.5-Turbo	CorrSc	0.617	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove