Cross-Lingual Document Classification
12 papers with code • 10 benchmarks • 2 datasets
Cross-lingual document classification refers to the task of using data and models available for one language for which ample such resources are available (e.g., English) to solve classification tasks in another, commonly low-resource, language.
Latest papers
Multilingual and cross-lingual document classification: A meta-learning approach
The great majority of languages in the world are considered under-resourced for the successful application of deep learning methods.
Robust Cross-lingual Embeddings from Parallel Sentences
Recent advances in cross-lingual word embeddings have primarily relied on mapping-based methods, which project pretrained word embeddings from different languages into a shared space through a linear transformation.
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging.
Bridging the domain gap in cross-lingual document classification
We consider the setting of semi-supervised cross-lingual understanding, where labeled data is available in a source language (English), but only unlabeled data is available in the target language.
MultiFiT: Efficient Multi-lingual Language Model Fine-tuning
Pretrained language models are promising particularly for low-resource languages as they only require unlabelled data.
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts.
A Corpus for Multilingual Document Classification in Eight Languages
In addition, we have observed that the class prior distributions differ significantly between the languages.
Learning Crosslingual Word Embeddings without Bilingual Corpora
Crosslingual word embeddings represent lexical items from different languages in the same vector space, enabling transfer of NLP tools.
Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification
To tackle the sentiment classification problem in low-resource languages without adequate annotated data, we propose an Adversarial Deep Averaging Network (ADAN) to transfer the knowledge learned from labeled data on a resource-rich source language to low-resource languages where only unlabeled data exists.
BilBOWA: Fast Bilingual Distributed Representations without Word Alignments
We introduce BilBOWA (Bilingual Bag-of-Words without Alignments), a simple and computationally-efficient model for learning bilingual distributed representations of words which can scale to large monolingual datasets and does not require word-aligned parallel training data.