CodeInstruct: Empowering Language Models to Edit Code

Code editing encompasses a variety of pragmatic tasks that developers deal with daily. Despite its relevance and practical usefulness, automatic code editing remains an underexplored area in the evolution of deep learning models, partly due to data scarcity. In this work, we explore the use of large language models (LLMs) to edit code based on user instructions, covering a broad range of implicit tasks such as comment insertion, code optimization, and code refactoring. To facilitate this, we introduce CodeInstruct, the first dataset designed to adapt LLMs for general-purpose code editing, containing high-diversity code-editing tasks. It consists of over 114,000 instruction-input-output triplets and covers multiple distinct code editing scenarios. The dataset is systematically expanded through an iterative process that commences with code editing data sourced from GitHub commits as seed tasks. Seed and generated tasks are used subsequently to prompt ChatGPT for more task data. Our experiments demonstrate that open-source LLMs fine-tuned on CodeInstruct can edit code correctly based on users' instructions most of the time , exhibiting unprecedented code-editing performance levels on par with ChatGPT. Such results suggest that proficient instruction-finetuning can lead to significant amelioration in code-editing abilities.

PDF Abstract

Datasets


Introduced in the Paper:

CodeInstruct

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here