Language Identification
123 papers with code • 6 benchmarks • 19 datasets
Language identification is the task of determining the language of a text.
Libraries
Use these libraries to find Language Identification models and implementationsDatasets
Latest papers
What is Learnt by the LEArnable Front-end (LEAF)? Adapting Per-Channel Energy Normalisation (PCEN) to Noisy Conditions
There is increasing interest in the use of the LEArnable Front-end (LEAF) in a variety of speech processing systems.
Geographically-Informed Language Identification
The result is a highly-accurate model that covers 916 languages at a sample size of 50 characters, the performance improved by incorporating geographic information into the model.
Language and Speech Technology for Central Kurdish Varieties
Kurdish, an Indo-European language spoken by over 30 million speakers, is considered a dialect continuum and known for its diversity in language varieties.
KInIT at SemEval-2024 Task 8: Fine-tuned LLMs for Multilingual Machine-Generated Text Detection
SemEval-2024 Task 8 is focused on multigenerator, multidomain, and multilingual black-box machine-generated text detection.
Code-Switched Language Identification is Harder Than You Think
Code switching (CS) is a very common phenomenon in written and spoken communication but one that is handled poorly by many natural language processing applications.
AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters
Large language models' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation.
Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math
Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus.
OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for Offensive Language Identification
Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech.
GlotLID: Language Identification for Low-Resource Languages
Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages.
Native Language Identification with Big Bird Embeddings
Native Language Identification (NLI) intends to classify an author's native language based on their writing in another language.