Datasets > Modality > Texts > C4 (Colossal Clean Crawled Corpus)

C4 (Colossal Clean Crawled Corpus)

Introduced by Raffel et al. in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models.

The dataset can be downloaded in a pre-processed form from allennlp.

Samples

License

  • Unknown

Modalities

Languages

Tasks