Transliteration
46 papers with code • 0 benchmarks • 5 datasets
Transliteration is a mechanism for converting a word in a source (foreign) language to a target language, and often adopts approaches from machine translation. In machine translation, the objective is to preserve the semantic meaning of the utterance as much as possible while following the syntactic structure in the target language. In Transliteration, the objective is to preserve the original pronunciation of the source word as much as possible while following the phonological structures of the target language.
For example, the city’s name “Manchester” has become well known by people of languages other than English. These new words are often named entities that are important in cross-lingual information retrieval, information extraction, machine translation, and often present out-of-vocabulary challenges to spoken language technologies such as automatic speech recognition, spoken keyword search, and text-to-speech.
Benchmarks
These leaderboards are used to track progress in Transliteration
Most implemented papers
Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset
This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages.
Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus
We present a corpus containing 12, 327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji.
A Large-scale Evaluation of Neural Machine Transliteration for Indic Languages
We take up the task of large-scale evaluation of neural machine transliteration between English and Indic languages, with a focus on multilingual transliteration to utilize orthographic similarity between Indian languages.
On Biasing Transformer Attention Towards Monotonicity
Many sequence-to-sequence tasks in natural language processing are roughly monotonic in the alignment between source and target sequence, and previous work has facilitated or enforced learning of monotonic attention behavior via specialized attention functions or pretraining.
Neural String Edit Distance
We propose the neural string edit distance model for string-pair matching and string transduction based on learnable string edit distance.
Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study
RelateLM uses transliteration to convert the unseen script of limited LRL text into the script of a Related Prominent Language (RPL) (Hindi in our case).
Specializing Multilingual Language Models: An Empirical Study
Pretrained multilingual language models have become a common tool in transferring NLP capabilities to low-resource languages, often with adaptations.
Towards Offensive Language Identification for Tamil Code-Mixed YouTube Comments and Posts
The experimental results showed that ULMFiT is the best model for this task.
Cross-Lingual Text Classification of Transliterated Hindi and Malayalam
Transliteration is very common on social media, but transliterated text is not adequately handled by modern neural models for various NLP tasks.
Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages
We hypothesize and validate that multilingual fine-tuning of pre-trained language models can yield better performance on downstream NLP applications, compared to models fine-tuned on individual languages.