The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found
259 PAPERS • 12 BENCHMARKS
CoDesc is a large dataset of 4.2m Java source code and parallel data of their description from code search, and code summarization studies.
4 PAPERS • 2 BENCHMARKS
Presents a new dataset of code snippets with short descriptions, created using data gathered from Stackoverflow, a popular programming help website. Since access is open and unrestricted, the content is inherently noisy (ungrammatical, non-parsable, lacking content).
3 PAPERS • NO BENCHMARKS YET