Search Results for author: Taja Kuzman

Found 7 papers, 1 papers with code

MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

no code implementations • EAMT 2022 • Marta Bañón, Miquel Esplà-Gomis, Mikel L. Forcada, Cristian García-Romero, Taja Kuzman, Nikola Ljubešić, Rik van Noord, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Peter Rupnik, Vít Suchomel, Antonio Toral, Tobias van der Werff, Jaume Zaragoza

We introduce the project “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages”, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages.

Paper
Add Code

Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining

1 code implementation • 8 Apr 2024 • Nikola Ljubešić, Vít Suchomel, Peter Rupnik, Taja Kuzman, Rik van Noord

The world of language models is going through turbulent times, better and ever larger models are coming out at an unprecedented speed.

Paper
Code

CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation

no code implementations • 19 Mar 2024 • Nikola Ljubešić, Taja Kuzman

This paper presents a collection of highly comparable web corpora of Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, covering thereby the whole spectrum of official languages in the South Slavic language space.

Paper
Add Code

Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages

no code implementations • 13 Mar 2024 • Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljubešić, Miquel Esplà-Gomis, Gema Ramírez-Sánchez, Antonio Toral

Large, curated, web-crawled corpora play a vital role in training language models (LMs).

Paper
Add Code

ChatGPT: Beginning of an End of Manual Linguistic Data Annotation? Use Case of Automatic Genre Identification

no code implementations • 7 Mar 2023 • Taja Kuzman, Igor Mozetič, Nikola Ljubešić

Results show that ChatGPT outperforms the fine-tuned model when applied to the dataset which was not seen before by either of the models.

Language Modelling text-classification +3

Paper
Add Code

The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild

no code implementations • LREC 2022 • Taja Kuzman, Peter Rupnik, Nikola Ljubešić

This paper presents a new training dataset for automatic genre identification GINCO, which is based on 1, 125 crawled Slovenian web documents that consist of 650 thousand words.

Paper
Add Code

Neural Machine Translation of Literary Texts from English to Slovene

no code implementations • WS 2019 • Taja Kuzman, {\v{S}}pela Vintar, Mihael Ar{\v{c}}an

Machine Translation Translation

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.