no code implementations • WMT (EMNLP) 2020 • Miquel Esplà-Gomis, Víctor M. Sánchez-Cartagena, Jaume Zaragoza-Bernabeu, Felipe Sánchez-Martínez
This paper describes the joint submission of Universitat d’Alacant and Prompsit Language Engineering to the WMT 2020 shared task on parallel corpus filtering.
no code implementations • HumEval (ACL) 2022 • Gema Ramírez-Sánchez, Marta Bañón, Jaume Zaragoza-Bernabeu, Sergio Ortiz Rojas
Quality assessment has been an ongoing activity of the series of ParaCrawl efforts to crawl massive amounts of parallel data from multilingual websites for 29 languages.
1 code implementation • EAMT 2020 • Gema Ramírez-Sánchez, Jaume Zaragoza-Bernabeu, Marta Bañón, Sergio Ortiz Rojas
This paper shows the utility of two open-source tools designed for parallel data cleaning: Bifixer and Bicleaner.
1 code implementation • LREC 2022 • Jaume Zaragoza-Bernabeu, Gema Ramírez-Sánchez, Marta Bañón, Sergio Ortiz Rojas
This paper describes the experiments carried out during the development of the latest version of Bicleaner, named Bicleaner AI, a tool that aims at detecting noisy sentences in parallel corpora.
no code implementations • 12 Apr 2024 • Marta Bañón, Jaume Zaragoza-Bernabeu, Gema Ramírez-Sánchez, Sergio Ortiz-Rojas
Language identification is a crucial component in the automated production of language resources, particularly in multilingual and big data contexts.
no code implementations • 20 Mar 2024 • Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer Van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, Jörg Tiedemann
We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive.
2 code implementations • 24 Nov 2023 • Nikolay Bogoychev, Jelmer Van der Linde, Graeme Nail, Barry Haddow, Jaume Zaragoza-Bernabeu, Gema Ramírez-Sánchez, Lukas Weymann, Tudor Nicolae Mateiu, Jindřich Helcl, Mikko Aulamo
Developing high quality machine translation systems is a labour intensive, challenging and confusing process for newcomers to the field.