Lemmatization

61 papers with code • 0 benchmarks • 3 datasets

Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Especially for languages with rich morphology it is important to be able to normalize words into their base forms to better support for example search engines and linguistic studies. Main difficulties in Lemmatization arise from encountering previously unseen words during inference time as well as disambiguating ambiguous surface forms which can be inflected variants of several different base forms depending on the context.

Source: Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks

Libraries

Use these libraries to find Lemmatization models and implementations
3 papers
148

Stylistic Fingerprints, POS-tags and Inflected Languages: A Case Study in Polish

computationalstylistics/pl_lemmatization_in_attribution 5 Jun 2022

In inflected languages, word endings play a prominent role, and hence different word forms cannot be recognized using generic text tokenization.

0
05 Jun 2022

HuSpaCy: an industrial-strength Hungarian natural language processing toolkit

huspacy/huspacy 6 Jan 2022

Although there are a couple of open-source language processing pipelines available for Hungarian, none of them satisfies the requirements of today's NLP applications.

148
06 Jan 2022

ELIT: Emory Language and Information Toolkit

emorynlp/elit 8 Sep 2021

We introduce ELIT, the Emory Language and Information Toolkit, which is a comprehensive NLP framework providing transformer-based end-to-end models for core tasks with a special focus on memory efficiency while maintaining state-of-the-art accuracy and speed.

36
08 Sep 2021

Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography

mikahama/murre JEP/TALN/RECITAL 2021

Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century.

21
07 Jul 2021

Neural Morphology Dataset and Models for Multiple Languages, from the Large to the Endangered

mikahama/uralicNLP NoDaLiDa 2021

We train neural models for morphological analysis, generation and lemmatization for morphologically rich languages.

71
26 May 2021

Enhancing Sequence-to-Sequence Neural Lemmatization with External Resources

501Good/lexicon-enhanced-lemmatization EACL 2021

We also compare with other methods of integrating external data into lemmatization and show that our enhanced system performs considerably better than a simple lexicon extension method based on the Stanza system, and it achieves complementary improvements w. r. t.

0
28 Jan 2021

DBTagger: Multi-Task Learning for Keyword Mapping in NLIDBs Using Bi-Directional Recurrent Neural Networks

arifusta/DBTagger 11 Jan 2021

In the pipeline, one of the most critical and challenging problems is keyword mapping; constructing a mapping between tokens in the query and relational database elements (tables, attributes, values, etc.).

4
11 Jan 2021

Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing

nlp-uoregon/trankit EACL 2021

Finally, we create a demo video for Trankit at: https://youtu. be/q0KGP3zGjGc.

711
09 Jan 2021

The Role of Interpretable Patterns in Deep Learning for Morphology

juditacs/deep-morphology 8 Dec 2020

By training the models on the same source but different target, we can compare what subwords are important for different tasks and how they relate to each other.

1
08 Dec 2020

TopicModel4J: A Java Package for Topic Models

soberqian/TopicModel4J 28 Oct 2020

Topic models provide a flexible and principled framework for exploring hidden structure in high-dimensional co-occurrence data and are commonly used natural language processing (NLP) of text.

27
28 Oct 2020