Search Results for author: Brian Roark

Found 39 papers, 8 papers with code

Criteria for Useful Automatic Romanization in South Asian Languages

no code implementations • LREC 2022 • Isin Demirsahin, Cibu Johny, Alexander Gutkin, Brian Roark

This paper presents a number of possible criteria for systems that transliterate South Asian languages from their native scripts into the Latin script, a process known as romanization.

Paper
Add Code

Extensions to Brahmic script processing within the Nisaba library: new scripts, languages and utilities

no code implementations • LREC 2022 • Alexander Gutkin, Cibu Johny, Raiomond Doctor, Lawrence Wolf-Sonkin, Brian Roark

The Brahmic family of scripts is used to record some of the most spoken languages in the world and is arguably the most diverse family of writing systems.

Transliteration

Paper
Add Code

Design principles of an open-source language modeling microservice package for AAC text-entry applications

no code implementations • SLPAT (ACL) 2022 • Brian Roark, Alexander Gutkin

We present MozoLM, an open-source language model microservice package intended for use in AAC text-entry applications, with a particular focus on the design principles of the library.

Language Modelling

Paper
Add Code

XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages

1 code implementation • 19 May 2023 • Sebastian Ruder, Jonathan H. Clark, Alexander Gutkin, Mihir Kale, Min Ma, Massimo Nicosia, Shruti Rijhwani, Parker Riley, Jean-Michel A. Sarr, Xinyi Wang, John Wieting, Nitish Gupta, Anna Katanova, Christo Kirov, Dana L. Dickinson, Brian Roark, Bidisha Samanta, Connie Tao, David I. Adelani, Vera Axelrod, Isaac Caswell, Colin Cherry, Dan Garrette, Reeve Ingle, Melvin Johnson, Dmitry Panteleev, Partha Talukdar

We evaluate commonly used models on the benchmark.

In-Context Learning Multilingual NLP +3

Paper
Code

Spelling convention sensitivity in neural language models

no code implementations • 6 Mar 2023 • Elizabeth Nielsen, Christo Kirov, Brian Roark

Using a set of probe words unique to either British or American English, we first establish that training corpora exhibit substantial (though not total) consistency.

Language Modelling

Paper
Add Code

Beyond Arabic: Software for Perso-Arabic Script Manipulation

1 code implementation • 26 Jan 2023 • Alexander Gutkin, Cibu Johny, Raiomond Doctor, Brian Roark, Richard Sproat

This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.

Transliteration

Paper
Code

Graphemic Normalization of the Perso-Arabic Script

1 code implementation • 21 Oct 2022 • Raiomond Doctor, Alexander Gutkin, Cibu Johny, Brian Roark, Richard Sproat

Since its original appearance in 1991, the Perso-Arabic script representation in Unicode has grown from 169 to over 440 atomic isolated characters spread over several code pages representing standard letters, various diacritics and punctuation for the original Arabic and numerous other regional orthographic traditions.

Language Modelling Machine Translation

32,836

Paper
Code

Structured abbreviation expansion in context

no code implementations • Findings (EMNLP) 2021 • Kyle Gorman, Christo Kirov, Brian Roark, Richard Sproat

Ad hoc abbreviations are commonly found in informal communication channels that favor shorter messages.

Spelling Correction

Paper
Add Code

Finding Concept-specific Biases in Form--Meaning Associations

2 code implementations • NAACL 2021 • Tiago Pimentel, Brian Roark, Søren Wichmann, Ryan Cotterell, Damián Blasi

It is not a new idea that there are small, cross-linguistic associations between the forms and meanings of words.

Paper
Code

Finite-state script normalization and processing utilities: The Nisaba Brahmic library

no code implementations • EACL 2021 • Cibu Johny, Lawrence Wolf-Sonkin, Alexander Gutkin, Brian Roark

This paper presents an open-source library for efficient low-level processing of ten major South Asian Brahmic scripts.

Transliteration

Paper
Add Code

Disambiguatory Signals are Stronger in Word-initial Positions

1 code implementation • EACL 2021 • Tiago Pimentel, Ryan Cotterell, Brian Roark

Psycholinguistic studies of human word processing and lexical access provide ample evidence of the preferred nature of word-initial versus word-final segments, e. g., in terms of attention paid by listeners (greater) or the likelihood of reduction by speakers (lower).

Informativeness

Paper
Code

Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

1 code implementation • LREC 2020 • Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov, Sabrina J. Mielke, Cibu Johny, Isin Demirsahin, Keith Hall

This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages.

Language Modelling Sentence +1

182

Paper
Code

Phonotactic Complexity and its Trade-offs

1 code implementation • TACL 2020 • Tiago Pimentel, Brian Roark, Ryan Cotterell

We present methods for calculating a measure of phonotactic complexity---bits per phoneme---that permits a straightforward cross-linguistic comparison.

Paper
Code

Language-agnostic Multilingual Modeling

no code implementations • 20 Apr 2020 • Arindrima Datta, Bhuvana Ramabhadran, Jesse Emond, Anjuli Kannan, Brian Roark

Multilingual Automated Speech Recognition (ASR) systems allow for the joint training of data-rich and data-scarce languages in a single model.

speech-recognition Speech Recognition +1

Paper
Add Code

Distilling weighted finite automata from arbitrary probabilistic models

no code implementations • WS 2019 • An Suresh, a Theertha, Brian Roark, Michael Riley, Vlad Schogol

Weighted finite automata (WFA) are often used to represent probabilistic models, such as n-gram language models, since they are efficient for recognition tasks in time and space.

Paper
Add Code

Latin script keyboards for South Asian languages with finite-state normalization

no code implementations • WS 2019 • Lawrence Wolf-Sonkin, Vlad Schogol, Brian Roark, Michael Riley

The use of the Latin script for text entry of South Asian languages is common, even though there is no standard orthography for these languages in the script.

Transliteration

Paper
Add Code

Rethinking Phonotactic Complexity

no code implementations • WS 2019 • Tiago Pimentel, Brian Roark, Ryan Cotterell

In this work, we propose the use of phone-level language models to estimate phonotactic complexity{---}measured in bits per phoneme{---}which makes cross-linguistic comparison straightforward.

Paper
Add Code

Meaning to Form: Measuring Systematicity as Information

1 code implementation • ACL 2019 • Tiago Pimentel, Arya D. McCarthy, Damián E. Blasi, Brian Roark, Ryan Cotterell

A longstanding debate in semiotics centers on the relationship between linguistic signs and their corresponding semantics: is there an arbitrary relationship between a word form and its meaning, or does some systematic phenomenon pervade?

Paper
Code

What Kind of Language Is Hard to Language-Model?

no code implementations • ACL 2019 • Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, Brian Roark, Jason Eisner

Trying to answer the question of what features difficult languages have in common, we try and fail to reproduce our earlier (Cotterell et al., 2018) observation about morphological complexity and instead reveal far simpler statistics of the data that seem to drive complexity in a much larger sample.

Language Modelling Sentence

Paper
Add Code

Neural Models of Text Normalization for Speech Applications

no code implementations • CL 2019 • Hao Zhang, Richard Sproat, Axel H. Ng, Felix Stahlberg, Xiaochang Peng, Kyle Gorman, Brian Roark

One problem that has been somewhat resistant to effective machine learning solutions is text normalization for speech applications such as text-to-speech synthesis (TTS).

BIG-bench Machine Learning Speech Synthesis +1

Paper
Add Code

Approximating probabilistic models as weighted finite automata

no code implementations • CL (ACL) 2021 • Ananda Theertha Suresh, Brian Roark, Michael Riley, Vlad Schogol

Weighted finite automata (WFA) are often used to represent probabilistic models, such as $n$-gram language models, since they are efficient for recognition tasks in time and space.

Paper
Add Code

Are All Languages Equally Hard to Language-Model?

no code implementations • NAACL 2018 • Ryan Cotterell, Sabrina J. Mielke, Jason Eisner, Brian Roark

For general modeling methods applied to diverse languages, a natural question is: how well should we expect our models to work on languages with differing typological profiles?

Language Modelling