The Stack

Introduced by Kocetkov et al. in The Stack: 3 TB of permissively licensed source code

The Stack contains over 3TB of permissively-licensed source code files covering 30 programming languages crawled from GitHub. The dataset was created as part of the BigCode Project, an open scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs).

Source: https://huggingface.co/datasets/bigcode/the-stack

Homepage