Lemmatization

61 papers with code • 0 benchmarks • 3 datasets

Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Especially for languages with rich morphology it is important to be able to normalize words into their base forms to better support for example search engines and linguistic studies. Main difficulties in Lemmatization arise from encountering previously unseen words during inference time as well as disambiguating ambiguous surface forms which can be inflected variants of several different base forms depending on the context.

Source: Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks

Libraries

Use these libraries to find Lemmatization models and implementations
3 papers
148

Latest papers with no code

Comparison of Current Approaches to Lemmatization: A Case Study in Estonian

no code yet • 23 Apr 2024

This study evaluates three different lemmatization approaches to Estonian -- Generative character-level models, Pattern-based word-level classification models, and rule-based morphological analysis.

TartuNLP @ SIGTYP 2024 Shared Task: Adapting XLM-RoBERTa for Ancient and Historical Languages

no code yet • 19 Apr 2024

We present our submission to the unconstrained subtask of the SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages for morphological annotation, POS-tagging, lemmatization, character- and word-level gap-filling.

Opera Graeca Adnotata: Building a 34M+ Token Multilayer Corpus for Ancient Greek

no code yet • 31 Mar 2024

The texts have been enriched with seven annotation layers: (i) tokenization layer; (ii) sentence segmentation layer; (iii) lemmatization layer; (iv) morphological layer; (v) dependency layer; (vi) dependency function layer; (vii) Canonical Text Services (CTS) citation layer.

Cross-lingual Named Entity Corpus for Slavic Languages

no code yet • 30 Mar 2024

The corpus consists of 5 017 documents on seven topics.

ZAEBUC-Spoken: A Multilingual Multidialectal Arabic-English Speech Corpus

no code yet • 27 Mar 2024

We present ZAEBUC-Spoken, a multilingual multidialectal Arabic-English speech corpus.

The effect of stemming and lemmatization on Portuguese fake news text classification

no code yet • 17 Oct 2023

With the popularization of the internet, smartphones and social media, information is being spread quickly and easily way, which implies bigger traffic of information in the world, but there is a problem that is harming society with the dissemination of fake news.

Vacaspati: A Diverse Corpus of Bangla Literature

no code yet • 11 Jul 2023

We also demonstrate the efficacy of Vacaspati as a corpus by showing that similar models built from other corpora are not as effective.

Advancing Full-Text Search Lemmatization Techniques with Paradigm Retrieval from OpenCorpora

no code yet • 18 May 2023

In this paper, we unveil a groundbreaking method to amplify full-text search lemmatization, utilizing the OpenCorpora dataset and a bespoke paradigm retrieval algorithm.

LatinCy: Synthetic Trained Pipelines for Latin NLP

no code yet • 7 May 2023

This paper introduces LatinCy, a set of trained general purpose Latin-language "core" pipelines for use with the spaCy natural language processing framework.

Exploring the Use of Foundation Models for Named Entity Recognition and Lemmatization Tasks in Slavic Languages

no code yet • 11 Apr 2023

This paper describes Adam Mickiewicz University's (AMU) solution for the 4th Shared Task on SlavNER.