Transliteration
45 papers with code • 0 benchmarks • 5 datasets
Transliteration is a mechanism for converting a word in a source (foreign) language to a target language, and often adopts approaches from machine translation. In machine translation, the objective is to preserve the semantic meaning of the utterance as much as possible while following the syntactic structure in the target language. In Transliteration, the objective is to preserve the original pronunciation of the source word as much as possible while following the phonological structures of the target language.
For example, the city’s name “Manchester” has become well known by people of languages other than English. These new words are often named entities that are important in cross-lingual information retrieval, information extraction, machine translation, and often present out-of-vocabulary challenges to spoken language technologies such as automatic speech recognition, spoken keyword search, and text-to-speech.
Benchmarks
These leaderboards are used to track progress in Transliteration
Latest papers
Does Transliteration Help Multilingual Language Modeling?
We empirically measure the effect of transliteration on MLLMs in this context.
IIITT@Dravidian-CodeMix-FIRE2021: Transliterate or translate? Sentiment analysis of code-mixed text in Dravidian languages
This research paper bestows a tiny contribution to this research in the form of sentiment analysis of code-mixed social media comments in the popular Dravidian languages Kannada, Tamil and Malayalam.
Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages
We hypothesize and validate that multilingual fine-tuning of pre-trained language models can yield better performance on downstream NLP applications, compared to models fine-tuned on individual languages.
Cross-Lingual Text Classification of Transliterated Hindi and Malayalam
Transliteration is very common on social media, but transliterated text is not adequately handled by modern neural models for various NLP tasks.
Towards Offensive Language Identification for Tamil Code-Mixed YouTube Comments and Posts
The experimental results showed that ULMFiT is the best model for this task.
Specializing Multilingual Language Models: An Empirical Study
Pretrained multilingual language models have become a common tool in transferring NLP capabilities to low-resource languages, often with adaptations.
Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study
RelateLM uses transliteration to convert the unseen script of limited LRL text into the script of a Related Prominent Language (RPL) (Hindi in our case).
Sub-Character Tokenization for Chinese Pretrained Language Models
2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos.
Neural String Edit Distance
We propose the neural string edit distance model for string-pair matching and string transduction based on learnable string edit distance.
On Biasing Transformer Attention Towards Monotonicity
Many sequence-to-sequence tasks in natural language processing are roughly monotonic in the alignment between source and target sequence, and previous work has facilitated or enforced learning of monotonic attention behavior via specialized attention functions or pretraining.