Transliteration

46 papers with code • 0 benchmarks • 5 datasets

Transliteration is a mechanism for converting a word in a source (foreign) language to a target language, and often adopts approaches from machine translation. In machine translation, the objective is to preserve the semantic meaning of the utterance as much as possible while following the syntactic structure in the target language. In Transliteration, the objective is to preserve the original pronunciation of the source word as much as possible while following the phonological structures of the target language.

For example, the city’s name “Manchester” has become well known by people of languages other than English. These new words are often named entities that are important in cross-lingual information retrieval, information extraction, machine translation, and often present out-of-vocabulary challenges to spoken language technologies such as automatic speech recognition, spoken keyword search, and text-to-speech.

Source: Phonology-Augmented Statistical Framework for Machine Transliteration using Limited Linguistic Resources

Most implemented papers

Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

google-research-datasets/dakshina LREC 2020

This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages.

Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus

KurdishBLARK/InterdialectCorpus 4 Oct 2020

We present a corpus containing 12, 327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji.

A Large-scale Evaluation of Neural Machine Transliteration for Indic Languages

anoopkunchukuttan/indic_transiteration_analysis EACL 2021

We take up the task of large-scale evaluation of neural machine transliteration between English and Indic languages, with a focus on multilingual transliteration to utilize orthographic similarity between Indian languages.

On Biasing Transformer Attention Towards Monotonicity

ZurichNLP/monotonicity_loss NAACL 2021

Many sequence-to-sequence tasks in natural language processing are roughly monotonic in the alignment between source and target sequence, and previous work has facilitated or enforced learning of monotonic attention behavior via specialized attention functions or pretraining.

Neural String Edit Distance

jlibovicky/neural-string-edit-distance spnlp (ACL) 2022

We propose the neural string edit distance model for string-pair matching and string transduction based on learnable string edit distance.

Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study

yashkhem1/RelateLM ACL 2021

RelateLM uses transliteration to convert the unseen script of limited LRL text into the script of a Related Prominent Language (RPL) (Hindi in our case).

Specializing Multilingual Language Models: An Empirical Study

ethch18/specializing-multilingual EMNLP (MRL) 2021

Pretrained multilingual language models have become a common tool in transferring NLP capabilities to low-resource languages, often with adaptations.

Towards Offensive Language Identification for Tamil Code-Mixed YouTube Comments and Posts

chaarangan/odl-tamil-sn 24 Aug 2021

The experimental results showed that ULMFiT is the best model for this task.

Cross-Lingual Text Classification of Transliterated Hindi and Malayalam

jitinkrishnan/transliteration-hindi-malayalam 31 Aug 2021

Transliteration is very common on social media, but transliterated text is not adequately handled by modern neural models for various NLP tasks.

Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages

ibm/indo-aryan-language-family-model EMNLP 2021

We hypothesize and validate that multilingual fine-tuning of pre-trained language models can yield better performance on downstream NLP applications, compared to models fine-tuned on individual languages.