Lemmatization
61 papers with code • 0 benchmarks • 3 datasets
Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Especially for languages with rich morphology it is important to be able to normalize words into their base forms to better support for example search engines and linguistic studies. Main difficulties in Lemmatization arise from encountering previously unseen words during inference time as well as disambiguating ambiguous surface forms which can be inflected variants of several different base forms depending on the context.
Source: Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks
Benchmarks
These leaderboards are used to track progress in Lemmatization
Libraries
Use these libraries to find Lemmatization models and implementationsLatest papers
Stylistic Fingerprints, POS-tags and Inflected Languages: A Case Study in Polish
In inflected languages, word endings play a prominent role, and hence different word forms cannot be recognized using generic text tokenization.
HuSpaCy: an industrial-strength Hungarian natural language processing toolkit
Although there are a couple of open-source language processing pipelines available for Hungarian, none of them satisfies the requirements of today's NLP applications.
ELIT: Emory Language and Information Toolkit
We introduce ELIT, the Emory Language and Information Toolkit, which is a comprehensive NLP framework providing transformer-based end-to-end models for core tasks with a special focus on memory efficiency while maintaining state-of-the-art accuracy and speed.
Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography
Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century.
Neural Morphology Dataset and Models for Multiple Languages, from the Large to the Endangered
We train neural models for morphological analysis, generation and lemmatization for morphologically rich languages.
Enhancing Sequence-to-Sequence Neural Lemmatization with External Resources
We also compare with other methods of integrating external data into lemmatization and show that our enhanced system performs considerably better than a simple lexicon extension method based on the Stanza system, and it achieves complementary improvements w. r. t.
DBTagger: Multi-Task Learning for Keyword Mapping in NLIDBs Using Bi-Directional Recurrent Neural Networks
In the pipeline, one of the most critical and challenging problems is keyword mapping; constructing a mapping between tokens in the query and relational database elements (tables, attributes, values, etc.).
Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing
Finally, we create a demo video for Trankit at: https://youtu. be/q0KGP3zGjGc.
The Role of Interpretable Patterns in Deep Learning for Morphology
By training the models on the same source but different target, we can compare what subwords are important for different tasks and how they relate to each other.
TopicModel4J: A Java Package for Topic Models
Topic models provide a flexible and principled framework for exploring hidden structure in high-dimensional co-occurrence data and are commonly used natural language processing (NLP) of text.