no code implementations • NLP4DH (ICON) 2021 • Niko Partanen, Jack Rueter, Khalid Alnajjar, Mika Hämäläinen
The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castrén (1813–1852).
no code implementations • ComputEL (ACL) 2022 • Khalid Alnajjar, Mika Hämäläinen, Niko Tapio Partanen, Jack Rueter
Many endangered Uralic languages have multilingual machine readable dictionaries saved in an XML format.
no code implementations • WS (NoDaLiDa) 2019 • Jeff Ens, Mika Hämäläinen, Jack Rueter, Philippe Pasquier
Endangered Uralic languages present a high variety of inflectional forms in their morphology.
no code implementations • ACL (LChange) 2021 • Niko Partanen, Khalid Alnajjar, Mika Hämäläinen, Jack Rueter
In this study, we have normalized and lemmatized an Old Literary Finnish corpus using a lemmatization model trained on texts from Agricola.
no code implementations • 24 May 2023 • Khalid Alnajjar, Mika Hämäläinen, Jack Rueter
Furthermore, we align these word embeddings and present a novel neural network model that is trained on English data to conduct sentiment analysis and then applied on endangered language data through the aligned word embeddings.
no code implementations • 28 Dec 2021 • Niko Partanen, Jack Rueter, Mika Hämäläinen, Khalid Alnajjar
The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castr\'en (1813-1852).
no code implementations • WNUT (ACL) 2021 • Mika Hämäläinen, Pattama Patpong, Khalid Alnajjar, Niko Partanen, Jack Rueter
We present the first openly available corpus for detecting depression in Thai.
1 code implementation • EMNLP 2021 • Mika Hämäläinen, Khalid Alnajjar, Niko Partanen, Jack Rueter
Finnish is a language with multiple dialects that not only differ from each other in terms of accent (pronunciation) but also in terms of morphological forms and lexical choice.
no code implementations • NAACL (NLP4IF) 2021 • Mika Hämäläinen, Khalid Alnajjar, Niko Partanen, Jack Rueter
However, a model fine-tuned on Multilingual BERT reaches the best factual label accuracy of 97. 2%.
no code implementations • NAACL (AmericasNLP) 2021 • Jack Rueter, Marília Fernanda Pereira de Freitas, Sidney da Silva Facundes, Mika Hämäläinen, Niko Partanen
The construction of the treebank has also served as an opportunity to develop finite-state description of the language and facilitate the transfer of open-source infrastructure possibilities to an endangered language of the Amazon.
1 code implementation • NoDaLiDa 2021 • Mika Hämäläinen, Niko Partanen, Jack Rueter, Khalid Alnajjar
We train neural models for morphological analysis, generation and lemmatization for morphologically rich languages.
1 code implementation • COLING 2020 • Khalid Alnajjar, Mika Hämäläinen, Jack Rueter, Niko Partanen
We present an open-source online dictionary editing system, Ve'rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors.
2 code implementations • 11 Nov 2020 • Jack Rueter, Mika Hämäläinen, Niko Partanen
This document describes shared development of finite-state description of two closely related but endangered minority languages, Erzya and Moksha.
1 code implementation • 11 Oct 2020 • Khalid Alnajjar, Mika Hämäläinen, Niko Partanen, Jack Rueter
This study uses a character level neural machine translation approach trained on a long short-term memory-based bi-directional recurrent neural network architecture for diacritization of Medieval Arabic.
1 code implementation • 6 Sep 2020 • Mika Hämäläinen, Niko Partanen, Khalid Alnajjar, Jack Rueter, Thierry Poibeau
The models are tested with over 20 different dialects.
1 code implementation • LREC 2020 • Jack Rueter, Mika Hämäläinen
We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami.
1 code implementation • The sixth biennial conference on electronic lexicography, eLex 2019 2019 • Mika Hämäläinen, Jack Rueter
This makes it possible to integrate the system with the existing open-source Giellatekno infrastructure that provides and utilizes XML formatted dictionaries for use in a variety of NLP tasks.
1 code implementation • WS 2019 • Mika H{\"a}m{\"a}l{\"a}inen, Tanja S{\"a}ily, Jack Rueter, J{\"o}rg Tiedemann, Eetu M{\"a}kel{\"a}
This paper studies the use of NMT (neural machine translation) as a normalization method for an early English letter corpus.
no code implementations • COLING 2018 • Mika H{\"a}m{\"a}l{\"a}inen, Tanja S{\"a}ily, Jack Rueter, J{\"o}rg Tiedemann, Eetu M{\"a}kel{\"a}
This paper presents multiple methods for normalizing the most deviant and infrequent historical spellings in a corpus consisting of personal correspondence from the 15th to the 19th century.