no code implementations • 16 Oct 2023 • Tom Bryan, Jacob Carlson, Abhishek Arora, Melissa Dell
Given the diversity and sheer quantity of public domain texts, liberating them at scale requires optical character recognition (OCR) that is accurate, extremely cheap to deploy, and sample-efficient to customize to novel collections, languages, and character sets.
no code implementations • NeurIPS 2023 • Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Zejiang Shen, Luca D'Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring
The resulting American Stories dataset provides high quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge.
1 code implementation • 5 Apr 2023 • Jacob Carlson, Tom Bryan, Melissa Dell
Thousands of users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history.
6 code implementations • 29 Mar 2021 • Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, Weining Li
Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks.