Search Results for author: Yuval Pinter

Found 35 papers, 20 papers with code

CIAug: Equipping Interpolative Augmentation with Curriculum Learning

1 code implementation NAACL 2022 Ramit Sawhney, Ritesh Soun, Shrey Pandit, Megh Thakkar, Sarvagya Malaviya, Yuval Pinter

CIAug achieves state-of-the-art results over existing interpolative augmentation methods on 10 benchmark datasets across 4 languages in text classification and named-entity recognition tasks.

Data Augmentation named-entity-recognition +5

Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

1 code implementation20 Apr 2024 Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, Gábor Bella

Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that, in all language models studied (including ALBERT, BERT, RoBERTa, and DeBERTa), alien tokenization leads to poorer generalizations compared to morphological tokenization for semantic compositionality of word meanings.

An Analysis of BPE Vocabulary Trimming in Neural Machine Translation

no code implementations30 Mar 2024 Marco Cognetta, Tatsuya Hiraoka, Naoaki Okazaki, Rico Sennrich, Yuval Pinter

We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords.

Machine Translation Translation

Greed is All You Need: An Evaluation of Tokenizer Inference Methods

1 code implementation2 Mar 2024 Omri Uzan, Craig W. Schmidt, Chris Tanner, Yuval Pinter

While subword tokenizers such as BPE and WordPiece are typically used to build vocabularies for NLP models, the method of decoding text into a sequence of tokens from these vocabularies is often left unspecified, or ill-suited to the method in which they were constructed.

Tokenization Is More Than Compression

no code implementations28 Feb 2024 Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner

Tokenization is a foundational step in Natural Language Processing (NLP) tasks, bridging raw text and language models.

Data Compression

Domain-Specific Code Language Models: Unraveling the Potential for HPC Codes and Tasks

2 code implementations20 Dec 2023 Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Mihai Capota, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, Gal Oren

Specifically, we start off with HPC as a domain and build an HPC-specific LM, named MonoCoder, that is orders of magnitude smaller than existing LMs but delivers similar, if not better performance, on non-HPC and HPC tasks.

Code Generation

Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies

no code implementations19 Dec 2023 Anaelia Ovalle, Ninareh Mehrabi, Palash Goyal, Jwala Dhamala, Kai-Wei Chang, Richard Zemel, Aram Galstyan, Yuval Pinter, Rahul Gupta

Our paper is the first to link LLM misgendering to tokenization and deficient neopronoun grammar, indicating that LLMs unable to correctly treat neopronouns as pronouns are more prone to misgender.

Analyzing Cognitive Plausibility of Subword Tokenization

no code implementations20 Oct 2023 Lisa Beinborn, Yuval Pinter

Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce.

Emptying the Ocean with a Spoon: Should We Edit Models?

no code implementations18 Oct 2023 Yuval Pinter, Michael Elhadad

We call into question the recently popularized method of direct model editing as a means of correcting factual errors in LLM generations.

Model Editing Retrieval

Scope is all you need: Transforming LLMs for HPC Code

2 code implementations18 Aug 2023 Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, Gal Oren

With easier access to powerful compute resources, there is a growing trend in the field of AI for software development to develop larger and larger language models (LLMs) to address a variety of programming tasks.

Code Completion

Advising OpenMP Parallelization via a Graph-Based Approach with Transformers

2 code implementations16 May 2023 Tal Kadosh, Nadav Schneider, Niranjan Hasabnis, Timothy Mattson, Yuval Pinter, Gal Oren

Specifically, we propose a novel approach, called OMPify, to detect and predict the OpenMP pragmas and shared-memory attributes in parallel code, given its serial version.

Data Augmentation

Incorporating Context into Subword Vocabularies

1 code implementation13 Oct 2022 Shaked Yehezkel, Yuval Pinter

Most current popular subword tokenizers are trained based on word frequency statistics over a corpus, without considering information about co-occurrence or context.

NER

Lost in Space Marking

no code implementations2 Aug 2022 Cassandra L. Jacobs, Yuval Pinter

We look at a decision taken early in training a subword tokenizer, namely whether it should be the word-initial token that carries a special mark, or the word-final one.

UniMorph 4.0: Universal Morphology

no code implementations LREC 2022 Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Benoît Sagot, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay, Juan López Bautista, Gema Celeste Silva Villegas, Lucas Torroba Hennigen, Adam Ek, David Guriel, Peter Dirix, Jean-Philippe Bernardy, Andrey Scherbakov, Aziyana Bayyr-ool, Antonios Anastasopoulos, Roberto Zariquiey, Karina Sheifer, Sofya Ganieva, Hilaria Cruz, Ritván Karahóǧa, Stella Markantonatou, George Pavlidis, Matvey Plugaryov, Elena Klyachko, Ali Salehi, Candy Angulo, Jatayu Baxi, Andrew Krizhanovsky, Natalia Krizhanovskaya, Elizabeth Salesky, Clara Vania, Sardana Ivanova, Jennifer White, Rowan Hall Maudslay, Josef Valvoda, Ran Zmigrod, Paula Czarnowska, Irene Nikkarinen, Aelita Salchak, Brijesh Bhatt, Christopher Straughn, Zoey Liu, Jonathan North Washington, Yuval Pinter, Duygu Ataman, Marcin Wolinski, Totok Suhardijanto, Anna Yablonskaya, Niklas Stoehr, Hossep Dolatian, Zahroh Nuriah, Shyam Ratan, Francis M. Tyers, Edoardo M. Ponti, Grant Aiton, Aryaman Arora, Richard J. Hatcher, Ritesh Kumar, Jeremiah Young, Daria Rodionova, Anastasia Yemelina, Taras Andrushko, Igor Marchenko, Polina Mashkovtseva, Alexandra Serova, Emily Prud'hommeaux, Maria Nepomniashchaya, Fausto Giunchiglia, Eleanor Chodroff, Mans Hulden, Miikka Silfverberg, Arya D. McCarthy, David Yarowsky, Ryan Cotterell, Reut Tsarfaty, Ekaterina Vylomova

The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema.

Morphological Inflection

Learning to Parallelize in a Shared-Memory Environment with Transformers

2 code implementations27 Apr 2022 Re'em Harel, Yuval Pinter, Gal Oren

As a result, there is a growing need to utilize these architectures by introducing shared memory parallelization schemes to software applications.

Management

Integrating Approaches to Word Representation

no code implementations10 Sep 2021 Yuval Pinter

The problem of representing the atomic elements of language in modern neural learning systems is one of the central challenges of the field of natural language processing.

Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information

no code implementations1 Aug 2021 Yuval Pinter, Amanda Stent, Mark Dredze, Jacob Eisenstein

Commonly-used transformer language models depend on a tokenization schema which sets an unchangeable subword vocabulary prior to pre-training, destined to be applied to all downstream tasks regardless of domain shift, novel word formations, or other sources of vocabulary mismatch.

Restoring Hebrew Diacritics Without a Dictionary

1 code implementation Findings (NAACL) 2022 Elazar Gershuni, Yuval Pinter

We demonstrate that it is feasible to diacritize Hebrew script without any human-curated resources other than plain diacritized text.

Will it Unblend?

1 code implementation SCiL 2021 Yuval Pinter, Cassandra L. Jacobs, Jacob Eisenstein

Natural language processing systems often struggle with out-of-vocabulary (OOV) terms, which do not appear in training data.

Learning to Faithfully Rationalize by Construction

2 code implementations ACL 2020 Sarthak Jain, Sarah Wiegreffe, Yuval Pinter, Byron C. Wallace

In NLP this often entails extracting snippets of an input text `responsible for' corresponding model output; when such a snippet comprises tokens that indeed informed the model's prediction, it is a faithful explanation.

Feature Importance text-classification +1

NYTWIT: A Dataset of Novel Words in the New York Times

1 code implementation COLING 2020 Yuval Pinter, Cassandra L. Jacobs, Max Bittker

We present baseline results for both uncontextual and contextual prediction of novelty class, showing that there is room for improvement even for state-of-the-art NLP systems.

Attending Form and Context to Generate Specialized Out-of-VocabularyWords Representations

no code implementations14 Dec 2019 Nicolas Garneau, Jean-Samuel Leboeuf, Yuval Pinter, Luc Lamontagne

We propose a new contextual-compositional neural network layer that handles out-of-vocabulary (OOV) words in natural language processing (NLP) tagging tasks.

Sentence

Attention is not not Explanation

2 code implementations IJCNLP 2019 Sarah Wiegreffe, Yuval Pinter

We show that even when reliable adversarial distributions can be found, they don't perform well on the simple diagnostic, indicating that prior work does not disprove the usefulness of attention mechanisms for explainability.

Decision Making Experimental Design

Character Eyes: Seeing Language through Character-Level Taggers

1 code implementation WS 2019 Yuval Pinter, Marc Marone, Jacob Eisenstein

Character-level models have been used extensively in recent years in NLP tasks as both supplements and replacements for closed-vocabulary token-level word representations.

POS

Predicting Semantic Relations using Global Graph Properties

1 code implementation EMNLP 2018 Yuval Pinter, Jacob Eisenstein

Semantic graphs, such as WordNet, are resources which curate natural language on two distinguishable layers.

Link Prediction

Si O No, Que Penses? Catalonian Independence and Linguistic Identity on Social Media

no code implementations NAACL 2018 Ian Stewart, Yuval Pinter, Jacob Eisenstein

We also find that Catalan is used more often in referendum-related discourse than in other contexts, contrary to prior findings on language variation.

Sí o no, què penses? Catalonian Independence and Linguistic Identity on Social Media

1 code implementation13 Apr 2018 Ian Stewart, Yuval Pinter, Jacob Eisenstein

We also find that Catalan is used more often in referendum-related discourse than in other contexts, contrary to prior findings on language variation.

Mimicking Word Embeddings using Subword RNNs

2 code implementations EMNLP 2017 Yuval Pinter, Robert Guthrie, Jacob Eisenstein

In this paper, we present MIMICK, an approach to generating OOV word embeddings compositionally, by learning a function from spellings to distributional embeddings.

Word Embeddings

The Yahoo Query Treebank, V. 1.0

no code implementations10 May 2016 Yuval Pinter, Roi Reichart, Idan Szpektor

A description and annotation guidelines for the Yahoo Webscope release of Query Treebank, Version 1. 0, May 2016.

Cannot find the paper you are looking for? You can Submit a new open access paper.