Search Results for author: Tom Bryan

Found 3 papers, 1 papers with code

EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge

no code implementations • 16 Oct 2023 • Tom Bryan, Jacob Carlson, Abhishek Arora, Melissa Dell

Given the diversity and sheer quantity of public domain texts, liberating them at scale requires optical character recognition (OCR) that is accurate, extremely cheap to deploy, and sample-efficient to customize to novel collections, languages, and character sets.

Image Retrieval Language Modelling +3

Paper
Add Code

American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers

no code implementations • NeurIPS 2023 • Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Zejiang Shen, Luca D'Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring

The resulting American Stories dataset provides high quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge.

Language Modelling Large Language Model +3

Paper
Add Code

Efficient OCR for Building a Diverse Digital History

1 code implementation • 5 Apr 2023 • Jacob Carlson, Tom Bryan, Melissa Dell

Thousands of users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history.

Image Retrieval Language Modelling +3

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.