Lemmatization

61 papers with code • 0 benchmarks • 3 datasets

Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Especially for languages with rich morphology it is important to be able to normalize words into their base forms to better support for example search engines and linguistic studies. Main difficulties in Lemmatization arise from encountering previously unseen words during inference time as well as disambiguating ambiguous surface forms which can be inflected variants of several different base forms depending on the context.

Source: Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks

Libraries

Use these libraries to find Lemmatization models and implementations
3 papers
148

Evaluating Shortest Edit Script Methods for Contextual Lemmatization

hitz-zentroa/ses-lemma 25 Mar 2024

We experiment with seven languages of different morphological complexity, namely, English, Spanish, Basque, Russian, Czech, Turkish and Polish, using multilingual and language-specific pre-trained masked language encoder-only models as a backbone to build our lemmatizers.

0
25 Mar 2024

BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla Lemmatizer

eblict-gigatech/BanLemma 6 Nov 2023

Lemmatization holds significance in both natural language processing (NLP) and linguistics, as it effectively decreases data density and aids in comprehending contextual meaning.

0
06 Nov 2023

Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

huspacy/huspacy 24 Aug 2023

This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy.

148
24 Aug 2023

Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation

kevinkrahn/ancient-greek-datasets 24 Aug 2023

In this work, we use a multilingual knowledge distillation approach to train BERT models to produce sentence embeddings for Ancient Greek text.

6
24 Aug 2023

Lexicon and Rule-based Word Lemmatization Approach for the Somali Language

shafieabdi/somalilemmatizer 3 Aug 2023

Lemmatization is a Natural Language Processing (NLP) technique used to normalize text by changing morphological derivations of words to their root forms.

0
03 Aug 2023

Hybrid lemmatization in HuSpaCy

huspacy/huspacy 13 Jun 2023

Lemmatization is still not a trivial task for morphologically rich languages.

148
13 Jun 2023

Exploring Large Language Models for Classical Philology

heidelberg-nlp/ancient-language-models 23 May 2023

While prior work on Classical languages unanimously uses BERT, in this work we create four language models for Ancient Greek that vary along two dimensions to study their versatility for tasks of interest for Classical languages: we explore (i) encoder-only and encoder-decoder architectures using RoBERTa and T5 as strong model types, and create for each of them (ii) a monolingual Ancient Greek and a multilingual instance that includes Latin and English.

18
23 May 2023

BRENT: Bidirectional Retrieval Enhanced Norwegian Transformer

ltgoslo/brent 19 Apr 2023

After training, we also separate the language model, which we call the reader, from the retriever components, and show that this can be fine-tuned on a range of downstream tasks.

3
19 Apr 2023

Transformers on Multilingual Clause-Level Morphology

emrecanacikgoz/mrl2022 3 Nov 2022

While transformer architectures with data augmentation achieved the most promising results for inflection and reinflection tasks, prefix-tuning on mGPT received the highest results for the analysis task.

4
03 Nov 2022

Knowledge Authoring with Factual English

yuhengwang1/kalm-fl 5 Aug 2022

Unfortunately, at present, extraction of logical facts from unrestricted natural language is still too inaccurate to be used for reasoning, while restricting the grammar of the language (so-called controlled natural language, or CNL) is hard for the users to learn and use.

0
05 Aug 2022