Lemmatization
61 papers with code • 0 benchmarks • 3 datasets
Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Especially for languages with rich morphology it is important to be able to normalize words into their base forms to better support for example search engines and linguistic studies. Main difficulties in Lemmatization arise from encountering previously unseen words during inference time as well as disambiguating ambiguous surface forms which can be inflected variants of several different base forms depending on the context.
Source: Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks
Benchmarks
These leaderboards are used to track progress in Lemmatization
Libraries
Use these libraries to find Lemmatization models and implementationsLatest papers
Evaluating Shortest Edit Script Methods for Contextual Lemmatization
We experiment with seven languages of different morphological complexity, namely, English, Spanish, Basque, Russian, Czech, Turkish and Polish, using multilingual and language-specific pre-trained masked language encoder-only models as a backbone to build our lemmatizers.
BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla Lemmatizer
Lemmatization holds significance in both natural language processing (NLP) and linguistics, as it effectively decreases data density and aids in comprehending contextual meaning.
Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines
This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy.
Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation
In this work, we use a multilingual knowledge distillation approach to train BERT models to produce sentence embeddings for Ancient Greek text.
Lexicon and Rule-based Word Lemmatization Approach for the Somali Language
Lemmatization is a Natural Language Processing (NLP) technique used to normalize text by changing morphological derivations of words to their root forms.
Hybrid lemmatization in HuSpaCy
Lemmatization is still not a trivial task for morphologically rich languages.
Exploring Large Language Models for Classical Philology
While prior work on Classical languages unanimously uses BERT, in this work we create four language models for Ancient Greek that vary along two dimensions to study their versatility for tasks of interest for Classical languages: we explore (i) encoder-only and encoder-decoder architectures using RoBERTa and T5 as strong model types, and create for each of them (ii) a monolingual Ancient Greek and a multilingual instance that includes Latin and English.
BRENT: Bidirectional Retrieval Enhanced Norwegian Transformer
After training, we also separate the language model, which we call the reader, from the retriever components, and show that this can be fine-tuned on a range of downstream tasks.
Transformers on Multilingual Clause-Level Morphology
While transformer architectures with data augmentation achieved the most promising results for inflection and reinflection tasks, prefix-tuning on mGPT received the highest results for the analysis task.
Knowledge Authoring with Factual English
Unfortunately, at present, extraction of logical facts from unrestricted natural language is still too inaccurate to be used for reasoning, while restricting the grammar of the language (so-called controlled natural language, or CNL) is hard for the users to learn and use.