4 dataset results for Masked Language Modeling AND Texts

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

808 PAPERS • 3 BENCHMARKS

C4 (Colossal Clean Crawled Corpus)

C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models.

633 PAPERS • 1 BENCHMARK

CORD-19

CORD-19 is a free resource of tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses for use by the global research community.

157 PAPERS • 2 BENCHMARKS

DiFair

DiFair serves as a meticulous endeavor to address the oversight in evaluating the impact of bias mitigation on useful gender knowledge while assessing gender neutrality in pretrained language models. This metric delves into not only quantifying a model's biased tendencies but also assessing the preservation of useful gender knowledge.

1 PAPER • NO BENCHMARKS YET

Datasets

4 dataset results for Masked Language Modeling AND Texts