TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Code Generation	HumanEval	LLMCodeGen Scrum (GPT-3.5 + zero-shot)	Pass@1	78.5	# 11
Code Generation	MBPP	GPT-3.5 Turbo + LLMCodeGen Scrum	Accuracy	82.5	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/when-llm-based-code-generation-meets-the/code-generation-on-mbpp)](https://paperswithcode.com/sota/code-generation-on-mbpp?p=when-llm-based-code-generation-meets-the)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/when-llm-based-code-generation-meets-the/code-generation-on-humaneval)](https://paperswithcode.com/sota/code-generation-on-humaneval?p=when-llm-based-code-generation-meets-the)`

When LLM-based Code Generation Meets the Software Development Process

23 Mar 2024 · Feng Lin, Dong Jae Kim, Tse-Husn, Chen ·

Software process models play a pivotal role in fostering collaboration and communication within software teams, enabling them to tackle intricate development tasks effectively. This paper introduces LCG, a code generation framework inspired by established software engineering practices. LCG leverages multiple Large Language Model (LLM) agents to emulate various software process models, namely LCGWaterfall, LCGTDD, and LCGScrum. Each model assigns LLM agents specific roles such as requirement engineer, architect, developer, tester, and scrum master, mirroring typical development activities and communication patterns. Through collaborative efforts utilizing chain-of-thought and prompt composition techniques, the agents continuously refine themselves to enhance code quality. Utilizing GPT3.5 as the underlying LLM and baseline (GPT), we evaluate LCG across four code generation benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET. Results indicate LCGScrum outperforms other models, achieving Pass@1 scores of 75.2, 65.5, 82.5, and 56.7 in HumanEval, HumanEval-ET, MBPP, and MBPP-ET, respectively - an average 15% improvement over GPT. Analysis reveals distinct impacts of development activities on generated code, with design and code reviews contributing to enhanced exception handling, while design, testing, and code reviews mitigate code smells. Furthermore, temperature values exhibit negligible influence on Pass@1 across all models. However, variations in Pass@1 are notable for different GPT3.5 model versions, ranging from 5 to over 60 in HumanEval, highlighting the stability of LCG across model versions. This stability underscores the importance of adopting software process models to bolster the quality and consistency of LLM-generated code.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Code Generation

Language Modelling

Large Language Model

Datasets

HumanEval MBPP MBPP-ET HumanEval-ET

Results from the Paper

Add Remove

Ranked #6 on Code Generation on MBPP

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Code Generation	HumanEval	LLMCodeGen Scrum (GPT-3.5 + zero-shot)	Pass@1	78.5	# 11	Compare
Code Generation	MBPP	GPT-3.5 Turbo + LLMCodeGen Scrum	Accuracy	82.5	# 6	Compare

Methods

Add Remove

Adam • Attention Dropout • BPE • Cosine Annealing • Dense Connections • Discriminative Fine-Tuning • Dropout • GELU • GPT • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Weight Decay

Edit Social Preview

When LLM-based Code Generation Meets the Software Development Process

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove