CodeTrans: Towards Cracking the Language of Silicon's Code Through Self-Supervised Deep Learning and High Performance Computing

Currently, a growing number of mature natural language processing applications make people's life more convenient. Such applications are built by source code - the language in software engineering. However, the applications for understanding source code language to ease the software engineering process are under-researched. Simultaneously, the transformer model, especially its combination with transfer learning, has been proven to be a powerful technique for natural language processing tasks. These breakthroughs point out a promising direction for process source code and crack software engineering tasks. This paper describes CodeTrans - an encoder-decoder transformer model for tasks in the software engineering domain, that explores the effectiveness of encoder-decoder transformer models for six software engineering tasks, including thirteen sub-tasks. Moreover, we have investigated the effect of different training strategies, including single-task learning, transfer learning, multi-task learning, and multi-task learning with fine-tuning. CodeTrans outperforms the state-of-the-art models on all the tasks. To expedite future works in the software engineering domain, we have published our pre-trained models of CodeTrans. https://github.com/agemagician/CodeTrans

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Program Synthesis AlgoLisp CodeTrans-MT-TF-Small Accuracy 90.31 # 1
Code Documentation Generation CodeSearchNet - Go CodeTrans-TF-Large Smoothed BLEU-4 19.54 # 7
Code Documentation Generation CodeSearchNet - Java CodeTrans-MT-Large Smoothed BLEU-4 21.87 # 1
Code Documentation Generation CodeSearchNet - JavaScript CodeTrans-TF-Large Smoothed BLEU-4 18.98 # 2
Code Documentation Generation CodeSearchNet - Php CodeTrans-MT-Base Smoothed BLEU-4 26.23 # 1
Code Documentation Generation CodeSearchNet - Python CodeTrans-MT-Base Smoothed BLEU-4 20.39 # 1
Code Documentation Generation CodeSearchNet - Ruby CodeTrans-MT-Base Smoothed BLEU-4 15.26 # 1
Git Commit Message Generation CommitGen CodeTrans-TF-Large BLEU-4 44.41 # 1
API Sequence Recommendation DeepAPI CodeTrans-MT-TF-Large BLEU-4 73.39 # 1
Code Comment Generation DeepCom CodeTrans-TF-Large Smoothed BLEU-4 39.50 # 1
Source Code Summarization Summarizing Source Code using a Neural Attention Model - C# CodeTrans-MT-Large Smoothed BLEU-4 23.57 # 1
Source Code Summarization Summarizing Source Code using a Neural Attention Model - Python CodeTrans-MT-Base Smoothed BLEU-4 13.37 # 1
Source Code Summarization Summarizing Source Code using a Neural Attention Model - SQL CodeTrans-MT-TF-Large Smoothed BLEU-4 19.98 # 1

Methods