Lemmatization
61 papers with code • 0 benchmarks • 3 datasets
Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Especially for languages with rich morphology it is important to be able to normalize words into their base forms to better support for example search engines and linguistic studies. Main difficulties in Lemmatization arise from encountering previously unseen words during inference time as well as disambiguating ambiguous surface forms which can be inflected variants of several different base forms depending on the context.
Source: Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks
Benchmarks
These leaderboards are used to track progress in Lemmatization
Libraries
Use these libraries to find Lemmatization models and implementationsLatest papers with no code
Comparison of Current Approaches to Lemmatization: A Case Study in Estonian
This study evaluates three different lemmatization approaches to Estonian -- Generative character-level models, Pattern-based word-level classification models, and rule-based morphological analysis.
TartuNLP @ SIGTYP 2024 Shared Task: Adapting XLM-RoBERTa for Ancient and Historical Languages
We present our submission to the unconstrained subtask of the SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages for morphological annotation, POS-tagging, lemmatization, character- and word-level gap-filling.
Opera Graeca Adnotata: Building a 34M+ Token Multilayer Corpus for Ancient Greek
The texts have been enriched with seven annotation layers: (i) tokenization layer; (ii) sentence segmentation layer; (iii) lemmatization layer; (iv) morphological layer; (v) dependency layer; (vi) dependency function layer; (vii) Canonical Text Services (CTS) citation layer.
Cross-lingual Named Entity Corpus for Slavic Languages
The corpus consists of 5 017 documents on seven topics.
ZAEBUC-Spoken: A Multilingual Multidialectal Arabic-English Speech Corpus
We present ZAEBUC-Spoken, a multilingual multidialectal Arabic-English speech corpus.
The effect of stemming and lemmatization on Portuguese fake news text classification
With the popularization of the internet, smartphones and social media, information is being spread quickly and easily way, which implies bigger traffic of information in the world, but there is a problem that is harming society with the dissemination of fake news.
Vacaspati: A Diverse Corpus of Bangla Literature
We also demonstrate the efficacy of Vacaspati as a corpus by showing that similar models built from other corpora are not as effective.
Advancing Full-Text Search Lemmatization Techniques with Paradigm Retrieval from OpenCorpora
In this paper, we unveil a groundbreaking method to amplify full-text search lemmatization, utilizing the OpenCorpora dataset and a bespoke paradigm retrieval algorithm.
LatinCy: Synthetic Trained Pipelines for Latin NLP
This paper introduces LatinCy, a set of trained general purpose Latin-language "core" pipelines for use with the spaCy natural language processing framework.
Exploring the Use of Foundation Models for Named Entity Recognition and Lemmatization Tasks in Slavic Languages
This paper describes Adam Mickiewicz University's (AMU) solution for the 4th Shared Task on SlavNER.