no code implementations • LREC 2022 • Alexander Gutkin, Cibu Johny, Raiomond Doctor, Lawrence Wolf-Sonkin, Brian Roark
The Brahmic family of scripts is used to record some of the most spoken languages in the world and is arguably the most diverse family of writing systems.
no code implementations • LREC 2022 • Isin Demirsahin, Cibu Johny, Alexander Gutkin, Brian Roark
This paper presents a number of possible criteria for systems that transliterate South Asian languages from their native scripts into the Latin script, a process known as romanization.
1 code implementation • 26 Jan 2023 • Alexander Gutkin, Cibu Johny, Raiomond Doctor, Brian Roark, Richard Sproat
This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
1 code implementation • 21 Oct 2022 • Raiomond Doctor, Alexander Gutkin, Cibu Johny, Brian Roark, Richard Sproat
Since its original appearance in 1991, the Perso-Arabic script representation in Unicode has grown from 169 to over 440 atomic isolated characters spread over several code pages representing standard letters, various diacritics and punctuation for the original Arabic and numerous other regional orthographic traditions.
no code implementations • EACL 2021 • Cibu Johny, Lawrence Wolf-Sonkin, Alexander Gutkin, Brian Roark
This paper presents an open-source library for efficient low-level processing of ten major South Asian Brahmic scripts.
1 code implementation • 14 Oct 2020 • Alena Butryna, Shan-Hui Cathy Chu, Isin Demirsahin, Alexander Gutkin, Linne Ha, Fei He, Martin Jansche, Cibu Johny, Anna Katanova, Oddur Kjartansson, Chenfang Li, Tatiana Merkulova, Yin May Oo, Knot Pipatsrisawat, Clara Rivera, Supheakmungkol Sarin, Pasindu De Silva, Keshan Sodimana, Richard Sproat, Theeraphol Wattanavekin, Jaka Aris Eko Wibawa
This paper presents an overview of a program designed to address the growing need for developing freely available speech resources for under-represented languages.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
1 code implementation • LREC 2020 • Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov, Sabrina J. Mielke, Cibu Johny, Isin Demirsahin, Keith Hall
This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages.
no code implementations • LREC 2020 • Fei He, Shan-Hui Cathy Chu, Oddur Kjartansson, Clara Rivera, Anna Katanova, Alex Gutkin, er, Isin Demirsahin, Cibu Johny, Martin Jansche, Supheakmungkol Sarin, Knot Pipatsrisawat
We present free high quality multi-speaker speech corpora for Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu, which are six of the twenty two official languages of India spoken by 374 million native speakers.