Lemmatization

61 papers with code • 0 benchmarks • 3 datasets

Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Especially for languages with rich morphology it is important to be able to normalize words into their base forms to better support for example search engines and linguistic studies. Main difficulties in Lemmatization arise from encountering previously unseen words during inference time as well as disambiguating ambiguous surface forms which can be inflected variants of several different base forms depending on the context.

Source: Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks

Benchmarks

Add a Result

These leaderboards are used to track progress in Lemmatization

No evaluation results yet. Help compare methods by submitting evaluation metrics.

Libraries

Use these libraries to find Lemmatization models and implementations

huspacy/huspacy

3 papers

148

Datasets

Latest papers

Most implemented Social Latest No code

Stylistic Fingerprints, POS-tags and Inflected Languages: A Case Study in Polish

computationalstylistics/pl_lemmatization_in_attribution • 5 Jun 2022

In inflected languages, word endings play a prominent role, and hence different word forms cannot be recognized using generic text tokenization.

05 Jun 2022

Paper
Code

HuSpaCy: an industrial-strength Hungarian natural language processing toolkit

huspacy/huspacy • 6 Jan 2022

Although there are a couple of open-source language processing pipelines available for Hungarian, none of them satisfies the requirements of today's NLP applications.

148

06 Jan 2022

Paper
Code

ELIT: Emory Language and Information Toolkit

emorynlp/elit • • 8 Sep 2021

We introduce ELIT, the Emory Language and Information Toolkit, which is a comprehensive NLP framework providing transformer-based end-to-end models for core tasks with a special focus on memory efficiency while maintaining state-of-the-art accuracy and speed.

08 Sep 2021

Paper
Code

Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography

mikahama/murre • JEP/TALN/RECITAL 2021

Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century.

07 Jul 2021

Paper
Code

Neural Morphology Dataset and Models for Multiple Languages, from the Large to the Endangered

mikahama/uralicNLP • NoDaLiDa 2021

We train neural models for morphological analysis, generation and lemmatization for morphologically rich languages.

26 May 2021

Paper
Code

Enhancing Sequence-to-Sequence Neural Lemmatization with External Resources

501Good/lexicon-enhanced-lemmatization • • EACL 2021

We also compare with other methods of integrating external data into lemmatization and show that our enhanced system performs considerably better than a simple lexicon extension method based on the Stanza system, and it achieves complementary improvements w. r. t.

28 Jan 2021

Paper
Code

DBTagger: Multi-Task Learning for Keyword Mapping in NLIDBs Using Bi-Directional Recurrent Neural Networks

arifusta/DBTagger • • 11 Jan 2021

In the pipeline, one of the most critical and challenging problems is keyword mapping; constructing a mapping between tokens in the query and relational database elements (tables, attributes, values, etc.).

11 Jan 2021

Paper
Code

Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing

nlp-uoregon/trankit • • EACL 2021

Finally, we create a demo video for Trankit at: https://youtu. be/q0KGP3zGjGc.

711

09 Jan 2021

Paper
Code

The Role of Interpretable Patterns in Deep Learning for Morphology

juditacs/deep-morphology • • 8 Dec 2020

By training the models on the same source but different target, we can compare what subwords are important for different tasks and how they relate to each other.

08 Dec 2020

Paper
Code

TopicModel4J: A Java Package for Topic Models

soberqian/TopicModel4J • 28 Oct 2020

Topic models provide a flexible and principled framework for exploring hidden structure in high-dimensional co-occurrence data and are commonly used natural language processing (NLP) of text.

28 Oct 2020

Paper
Code

Lemmatization

Benchmarks Add a Result

Libraries

Datasets

Latest papers

Content

Benchmarks

Add a Result