Cross-Lingual Bitext Mining

5 papers with code • 4 benchmarks • 1 datasets

Cross-lingual bitext mining is the task of mining sentence pairs that are translations of each other from large text corpora.

Libraries

Use these libraries to find Cross-Lingual Bitext Mining models and implementations

Datasets


Most implemented papers

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

facebookresearch/LASER TACL 2019

We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts.

Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings

facebookresearch/LASER ACL 2019

Machine translation is highly sensitive to the size and quality of the training data, which has led to an increasing interest in collecting and filtering large parallel corpora.

Improving Neural Machine Translation Models with Monolingual Data

josephch405/curriculum-nmt ACL 2016

Neural Machine Translation (NMT) has obtained state-of-the art performance for several language pairs, while only using parallel data for training.

Parallel Sentence Mining by Constrained Decoding

marian-nmt/marian-dev ACL 2020

We present a novel method to extract parallel sentences from two monolingual corpora, using neural machine translation.

Majority Voting with Bidirectional Pre-translation For Bitext Retrieval

AlexJonesNLP/alt-bitexts RANLP (BUCC) 2021

Obtaining high-quality parallel corpora is of paramount importance for training NMT systems.