Language Identification

123 papers with code • 6 benchmarks • 19 datasets

Language identification is the task of determining the language of a text.

Benchmarks

Add a Result

These leaderboards are used to track progress in Language Identification

Dataset	Best Model	Compare
VoxLingua107	XLS-R	See all
OpenSubtitles	Apple bi-LSTM	See all
Universal Dependencies	Apple bi-LSTM	See all
Nordic Language Identification	FastText	See all
GlotLID-C	GlotLID	See all
VoxForge	ConformerG-P	See all

Libraries

Use these libraries to find Language Identification models and implementations

pytorch/fairseq

2 papers

29,251

facebookresearch/fairseq

2 papers

29,237

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

The WiLI benchmark dataset for written language identification

birolkuyumcu/language_identification • • 23 Jan 2018

This paper describes the WiLI-2018 benchmark dataset for monolingual written natural language identification.

Paper
Code

SpeechBrain: A General-Purpose Speech Toolkit

speechbrain/speechbrain • • 8 Jun 2021

SpeechBrain is an open-source and all-in-one speech toolkit.

Paper
Code

Scaling Speech Technology to 1,000+ Languages

facebookresearch/fairseq • • arXiv 2023

Expanding the language coverage of speech technology has the potential to improve access to information for many more people.

Paper
Code

GlotLID: Language Identification for Low-Resource Languages

cisnlp/glotlid • 24 Oct 2023

Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages.

Paper
Code

Universal Dependency Parsing for Hindi-English Code-switching

irshadbhat/nsdp-cs • NAACL 2018

We present a treebank of Hindi-English code-switching tweets under Universal Dependencies scheme and propose a neural stacking model for parsing that efficiently leverages part-of-speech tag and syntactic tree annotations in the code-switching treebank and the preexisting Hindi and English treebanks.

Paper
Code

Predicting the Type and Target of Offensive Posts in Social Media

idontflow/olid • NAACL 2019

In particular, we model the task hierarchically, identifying the type and the target of offensive messages in social media.

Paper
Code

SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)

VadymV/OffensEval • • SEMEVAL 2019

We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval).

Paper
Code

Word-level Embeddings for Cross-Task Transfer Learning in Speech Processing

bepierre/SpeechVGG • • 22 Oct 2019

Recent breakthroughs in deep learning often rely on representation learning and knowledge transfer.

Paper
Code

Common Voice: A Massively-Multilingual Speech Corpus

facebookresearch/covost • • LREC 2020

To our knowledge this is the largest audio corpus in the public domain for speech recognition, both in terms of number of hours and number of languages.

Paper
Code

VoxLingua107: a Dataset for Spoken Language Recognition

alumae/torch-xvectors-wav • • 25 Nov 2020

Speech activity detection and speaker diarization are used to extract segments from the videos that contain speech.

Paper
Code

Language Identification

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result