Lemmatization

61 papers with code • 0 benchmarks • 3 datasets

Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Especially for languages with rich morphology it is important to be able to normalize words into their base forms to better support for example search engines and linguistic studies. Main difficulties in Lemmatization arise from encountering previously unseen words during inference time as well as disambiguating ambiguous surface forms which can be inflected variants of several different base forms depending on the context.

Source: Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks

Libraries

Use these libraries to find Lemmatization models and implementations
3 papers
144

Most implemented papers

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

stanfordnlp/stanza ACL 2020

We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages.

LemmaTag: Jointly Tagging and Lemmatizing for Morphologically-Rich Languages with BRNNs

hyperparticle/LemmaTag 10 Aug 2018

We present LemmaTag, a featureless neural network architecture that jointly generates part-of-speech tags and lemmas for sentences by using bidirectional RNNs with character-level and word-level embeddings.

Improving Lemmatization of Non-Standard Languages with Joint Learning

emanjavacas/pie NAACL 2019

Lemmatization of standard languages is concerned with (i) abstracting over morphological differences and (ii) resolving token-lemma ambiguities of inflected words in order to map them to a dictionary headword.

Top2Vec: Distributed Representations of Topics

ddangelov/Top2Vec 19 Aug 2020

Distributed representations of documents and words have gained popularity due to their ability to capture semantics of words and documents.

Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

huspacy/huspacy 24 Aug 2023

This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy.

Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation

TickleForce/ancient-greek-datasets 24 Aug 2023

In this work, we use a multilingual knowledge distillation approach to train BERT models to produce sentence embeddings for Ancient Greek text.

Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization

creat89/SummTriver 14 Sep 2012

This paper describes a new method for normalization of words to further reduce the space of representation.

Development of a Hindi Lemmatizer

sainimohit23/hindi-stemmer 24 May 2013

We live in a translingual society, in order to communicate with people from different parts of the world we need to have an expertise in their respective languages.

Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning

jedgusse/collaborative-authorship 4 Mar 2016

In this paper we consider two sequence tagging tasks for medieval Latin: part-of-speech tagging and lemmatization.