no code implementations • LREC 2022 • Isin Demirsahin, Cibu Johny, Alexander Gutkin, Brian Roark
This paper presents a number of possible criteria for systems that transliterate South Asian languages from their native scripts into the Latin script, a process known as romanization.
no code implementations • LREC 2022 • Alexander Gutkin, Cibu Johny, Raiomond Doctor, Lawrence Wolf-Sonkin, Brian Roark
The Brahmic family of scripts is used to record some of the most spoken languages in the world and is arguably the most diverse family of writing systems.
no code implementations • SLPAT (ACL) 2022 • Brian Roark, Alexander Gutkin
We present MozoLM, an open-source language model microservice package intended for use in AAC text-entry applications, with a particular focus on the design principles of the library.
1 code implementation • 19 May 2023 • Sebastian Ruder, Jonathan H. Clark, Alexander Gutkin, Mihir Kale, Min Ma, Massimo Nicosia, Shruti Rijhwani, Parker Riley, Jean-Michel A. Sarr, Xinyi Wang, John Wieting, Nitish Gupta, Anna Katanova, Christo Kirov, Dana L. Dickinson, Brian Roark, Bidisha Samanta, Connie Tao, David I. Adelani, Vera Axelrod, Isaac Caswell, Colin Cherry, Dan Garrette, Reeve Ingle, Melvin Johnson, Dmitry Panteleev, Partha Talukdar
We evaluate commonly used models on the benchmark.
no code implementations • 6 Mar 2023 • Elizabeth Nielsen, Christo Kirov, Brian Roark
Using a set of probe words unique to either British or American English, we first establish that training corpora exhibit substantial (though not total) consistency.
1 code implementation • 26 Jan 2023 • Alexander Gutkin, Cibu Johny, Raiomond Doctor, Brian Roark, Richard Sproat
This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
1 code implementation • 21 Oct 2022 • Raiomond Doctor, Alexander Gutkin, Cibu Johny, Brian Roark, Richard Sproat
Since its original appearance in 1991, the Perso-Arabic script representation in Unicode has grown from 169 to over 440 atomic isolated characters spread over several code pages representing standard letters, various diacritics and punctuation for the original Arabic and numerous other regional orthographic traditions.
no code implementations • Findings (EMNLP) 2021 • Kyle Gorman, Christo Kirov, Brian Roark, Richard Sproat
Ad hoc abbreviations are commonly found in informal communication channels that favor shorter messages.
2 code implementations • NAACL 2021 • Tiago Pimentel, Brian Roark, Søren Wichmann, Ryan Cotterell, Damián Blasi
It is not a new idea that there are small, cross-linguistic associations between the forms and meanings of words.
no code implementations • EACL 2021 • Cibu Johny, Lawrence Wolf-Sonkin, Alexander Gutkin, Brian Roark
This paper presents an open-source library for efficient low-level processing of ten major South Asian Brahmic scripts.
1 code implementation • EACL 2021 • Tiago Pimentel, Ryan Cotterell, Brian Roark
Psycholinguistic studies of human word processing and lexical access provide ample evidence of the preferred nature of word-initial versus word-final segments, e. g., in terms of attention paid by listeners (greater) or the likelihood of reduction by speakers (lower).
1 code implementation • LREC 2020 • Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov, Sabrina J. Mielke, Cibu Johny, Isin Demirsahin, Keith Hall
This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages.
1 code implementation • TACL 2020 • Tiago Pimentel, Brian Roark, Ryan Cotterell
We present methods for calculating a measure of phonotactic complexity---bits per phoneme---that permits a straightforward cross-linguistic comparison.
no code implementations • 20 Apr 2020 • Arindrima Datta, Bhuvana Ramabhadran, Jesse Emond, Anjuli Kannan, Brian Roark
Multilingual Automated Speech Recognition (ASR) systems allow for the joint training of data-rich and data-scarce languages in a single model.
no code implementations • WS 2019 • An Suresh, a Theertha, Brian Roark, Michael Riley, Vlad Schogol
Weighted finite automata (WFA) are often used to represent probabilistic models, such as n-gram language models, since they are efficient for recognition tasks in time and space.
no code implementations • WS 2019 • Lawrence Wolf-Sonkin, Vlad Schogol, Brian Roark, Michael Riley
The use of the Latin script for text entry of South Asian languages is common, even though there is no standard orthography for these languages in the script.
no code implementations • WS 2019 • Tiago Pimentel, Brian Roark, Ryan Cotterell
In this work, we propose the use of phone-level language models to estimate phonotactic complexity{---}measured in bits per phoneme{---}which makes cross-linguistic comparison straightforward.
1 code implementation • ACL 2019 • Tiago Pimentel, Arya D. McCarthy, Damián E. Blasi, Brian Roark, Ryan Cotterell
A longstanding debate in semiotics centers on the relationship between linguistic signs and their corresponding semantics: is there an arbitrary relationship between a word form and its meaning, or does some systematic phenomenon pervade?
no code implementations • ACL 2019 • Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, Brian Roark, Jason Eisner
Trying to answer the question of what features difficult languages have in common, we try and fail to reproduce our earlier (Cotterell et al., 2018) observation about morphological complexity and instead reveal far simpler statistics of the data that seem to drive complexity in a much larger sample.
no code implementations • CL 2019 • Hao Zhang, Richard Sproat, Axel H. Ng, Felix Stahlberg, Xiaochang Peng, Kyle Gorman, Brian Roark
One problem that has been somewhat resistant to effective machine learning solutions is text normalization for speech applications such as text-to-speech synthesis (TTS).
no code implementations • CL (ACL) 2021 • Ananda Theertha Suresh, Brian Roark, Michael Riley, Vlad Schogol
Weighted finite automata (WFA) are often used to represent probabilistic models, such as $n$-gram language models, since they are efficient for recognition tasks in time and space.
no code implementations • NAACL 2018 • Ryan Cotterell, Sabrina J. Mielke, Jason Eisner, Brian Roark
For general modeling methods applied to diverse languages, a natural question is: how well should we expect our models to work on languages with differing typological profiles?