Search Results for author: Craig W. Schmidt

Found 3 papers, 1 papers with code

Greed is All You Need: An Evaluation of Tokenizer Inference Methods

1 code implementation2 Mar 2024 Omri Uzan, Craig W. Schmidt, Chris Tanner, Yuval Pinter

While subword tokenizers such as BPE and WordPiece are typically used to build vocabularies for NLP models, the method of decoding text into a sequence of tokens from these vocabularies is often left unspecified, or ill-suited to the method in which they were constructed.

Tokenization Is More Than Compression

no code implementations28 Feb 2024 Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner

Tokenization is a foundational step in Natural Language Processing (NLP) tasks, bridging raw text and language models.

Data Compression

Improving a tf-idf weighted document vector embedding

no code implementations26 Feb 2019 Craig W. Schmidt

We examine a number of methods to compute a dense vector embedding for a document in a corpus, given a set of word vectors such as those from word2vec or GloVe.

Cannot find the paper you are looking for? You can Submit a new open access paper.