no code implementations • 16 Oct 2023 • Tom Bryan, Jacob Carlson, Abhishek Arora, Melissa Dell
Given the diversity and sheer quantity of public domain texts, liberating them at scale requires optical character recognition (OCR) that is accurate, extremely cheap to deploy, and sample-efficient to customize to novel collections, languages, and character sets.
1 code implementation • 2 Sep 2023 • Abhishek Arora, Melissa Dell
By combining transformer language models with intuitive APIs that will be familiar to many users of popular string matching packages, LinkTransformer aims to democratize the benefits of LLMs among those who may be less familiar with deep learning frameworks.
no code implementations • NeurIPS 2023 • Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Zejiang Shen, Luca D'Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring
The resulting American Stories dataset provides high quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge.
1 code implementation • 24 May 2023 • Xinmei Yang, Abhishek Arora, Shao-Yu Jheng, Melissa Dell
Not all character substitutions are equally probable, and for some settings there are widely used handcrafted lists denoting which string substitutions are more likely, that improve the accuracy of string matching.
no code implementations • 7 Apr 2023 • Abhishek Arora, Xinmei Yang, Shao-Yu Jheng, Melissa Dell
CLIPPINGS outperforms widely used string matching methods by a wide margin and also outperforms unimodal methods.