no code implementations • 20 Dec 2022 • Tim Jansen, Yangling Tong, Victoria Zevallos, Pedro Ortiz Suarez
As demand for large corpora increases with the size of current state-of-the-art language models, using web data as the main part of the pre-training corpus for these models has become a ubiquitous practice.