Lemmatization

61 papers with code • 0 benchmarks • 3 datasets

Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Especially for languages with rich morphology it is important to be able to normalize words into their base forms to better support for example search engines and linguistic studies. Main difficulties in Lemmatization arise from encountering previously unseen words during inference time as well as disambiguating ambiguous surface forms which can be inflected variants of several different base forms depending on the context.

Source: Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks

Benchmarks

Add a Result

These leaderboards are used to track progress in Lemmatization

No evaluation results yet. Help compare methods by submitting evaluation metrics.

Libraries

Use these libraries to find Lemmatization models and implementations

huspacy/huspacy

3 papers

148

Datasets

Latest papers with no code

Most implemented Social Latest No code

Comparison of Current Approaches to Lemmatization: A Case Study in Estonian

no code yet • 23 Apr 2024

This study evaluates three different lemmatization approaches to Estonian -- Generative character-level models, Pattern-based word-level classification models, and rule-based morphological analysis.

Paper
Add Code

TartuNLP @ SIGTYP 2024 Shared Task: Adapting XLM-RoBERTa for Ancient and Historical Languages

no code yet • 19 Apr 2024

We present our submission to the unconstrained subtask of the SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages for morphological annotation, POS-tagging, lemmatization, character- and word-level gap-filling.

Paper
Add Code

Opera Graeca Adnotata: Building a 34M+ Token Multilayer Corpus for Ancient Greek

no code yet • 31 Mar 2024

The texts have been enriched with seven annotation layers: (i) tokenization layer; (ii) sentence segmentation layer; (iii) lemmatization layer; (iv) morphological layer; (v) dependency layer; (vi) dependency function layer; (vii) Canonical Text Services (CTS) citation layer.

Paper
Add Code

Cross-lingual Named Entity Corpus for Slavic Languages

no code yet • 30 Mar 2024

The corpus consists of 5 017 documents on seven topics.

Paper
Add Code

ZAEBUC-Spoken: A Multilingual Multidialectal Arabic-English Speech Corpus

no code yet • 27 Mar 2024

We present ZAEBUC-Spoken, a multilingual multidialectal Arabic-English speech corpus.

Paper
Add Code

The effect of stemming and lemmatization on Portuguese fake news text classification

no code yet • 17 Oct 2023

With the popularization of the internet, smartphones and social media, information is being spread quickly and easily way, which implies bigger traffic of information in the world, but there is a problem that is harming society with the dissemination of fake news.

Paper
Add Code

Vacaspati: A Diverse Corpus of Bangla Literature

no code yet • 11 Jul 2023

We also demonstrate the efficacy of Vacaspati as a corpus by showing that similar models built from other corpora are not as effective.

Paper
Add Code

Advancing Full-Text Search Lemmatization Techniques with Paradigm Retrieval from OpenCorpora

no code yet • 18 May 2023

In this paper, we unveil a groundbreaking method to amplify full-text search lemmatization, utilizing the OpenCorpora dataset and a bespoke paradigm retrieval algorithm.

Paper
Add Code

LatinCy: Synthetic Trained Pipelines for Latin NLP

no code yet • 7 May 2023

This paper introduces LatinCy, a set of trained general purpose Latin-language "core" pipelines for use with the spaCy natural language processing framework.

Paper
Add Code

Exploring the Use of Foundation Models for Named Entity Recognition and Lemmatization Tasks in Slavic Languages

no code yet • 11 Apr 2023

This paper describes Adam Mickiewicz University's (AMU) solution for the 4th Shared Task on SlavNER.

Paper
Add Code

Lemmatization

Benchmarks Add a Result

Libraries

Datasets

Latest papers with no code

Content

Benchmarks

Add a Result