Lemmatization

61 papers with code • 0 benchmarks • 3 datasets

Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Especially for languages with rich morphology it is important to be able to normalize words into their base forms to better support for example search engines and linguistic studies. Main difficulties in Lemmatization arise from encountering previously unseen words during inference time as well as disambiguating ambiguous surface forms which can be inflected variants of several different base forms depending on the context.

Source: Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks

Benchmarks

Add a Result

These leaderboards are used to track progress in Lemmatization

No evaluation results yet. Help compare methods by submitting evaluation metrics.

Libraries

Use these libraries to find Lemmatization models and implementations

huspacy/huspacy

3 papers

148

Datasets

Latest papers

Most implemented Social Latest No code

Evaluating Shortest Edit Script Methods for Contextual Lemmatization

hitz-zentroa/ses-lemma • 25 Mar 2024

We experiment with seven languages of different morphological complexity, namely, English, Spanish, Basque, Russian, Czech, Turkish and Polish, using multilingual and language-specific pre-trained masked language encoder-only models as a backbone to build our lemmatizers.

25 Mar 2024

Paper
Code

BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla Lemmatizer

eblict-gigatech/BanLemma • 6 Nov 2023

Lemmatization holds significance in both natural language processing (NLP) and linguistics, as it effectively decreases data density and aids in comprehending contextual meaning.

06 Nov 2023

Paper
Code

Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

huspacy/huspacy • 24 Aug 2023

This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy.

148

24 Aug 2023

Paper
Code

Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation

kevinkrahn/ancient-greek-datasets • 24 Aug 2023

In this work, we use a multilingual knowledge distillation approach to train BERT models to produce sentence embeddings for Ancient Greek text.

24 Aug 2023

Paper
Code

Lexicon and Rule-based Word Lemmatization Approach for the Somali Language

shafieabdi/somalilemmatizer • 3 Aug 2023

Lemmatization is a Natural Language Processing (NLP) technique used to normalize text by changing morphological derivations of words to their root forms.

03 Aug 2023

Paper
Code

Hybrid lemmatization in HuSpaCy

huspacy/huspacy • 13 Jun 2023

Lemmatization is still not a trivial task for morphologically rich languages.

148

13 Jun 2023

Paper
Code

Exploring Large Language Models for Classical Philology

heidelberg-nlp/ancient-language-models • 23 May 2023

While prior work on Classical languages unanimously uses BERT, in this work we create four language models for Ancient Greek that vary along two dimensions to study their versatility for tasks of interest for Classical languages: we explore (i) encoder-only and encoder-decoder architectures using RoBERTa and T5 as strong model types, and create for each of them (ii) a monolingual Ancient Greek and a multilingual instance that includes Latin and English.

23 May 2023

Paper
Code

BRENT: Bidirectional Retrieval Enhanced Norwegian Transformer

ltgoslo/brent • • 19 Apr 2023

After training, we also separate the language model, which we call the reader, from the retriever components, and show that this can be fine-tuned on a range of downstream tasks.

19 Apr 2023

Paper
Code

Transformers on Multilingual Clause-Level Morphology

emrecanacikgoz/mrl2022 • • 3 Nov 2022

While transformer architectures with data augmentation achieved the most promising results for inflection and reinflection tasks, prefix-tuning on mGPT received the highest results for the analysis task.

03 Nov 2022

Paper
Code

Knowledge Authoring with Factual English

yuhengwang1/kalm-fl • • 5 Aug 2022

Unfortunately, at present, extraction of logical facts from unrestricted natural language is still too inaccurate to be used for reasoning, while restricting the grammar of the language (so-called controlled natural language, or CNL) is hard for the users to learn and use.

05 Aug 2022

Paper
Code

Lemmatization

Benchmarks Add a Result

Libraries

Datasets

Latest papers

Content

Benchmarks

Add a Result