Search Results for author: Emily Silcock

Found 2 papers, 1 papers with code

American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers

no code implementations • NeurIPS 2023 • Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Zejiang Shen, Luca D'Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring

The resulting American Stories dataset provides high quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge.

Language Modelling Large Language Model +3

Paper
Add Code

Noise-Robust De-Duplication at Scale

1 code implementation • 9 Oct 2022 • Emily Silcock, Luca D'Amico-Wong, Jinglin Yang, Melissa Dell

Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora.

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.