The Multilingual Reuters Collection dataset comprises over 11,000 articles from six classes in five languages, i.e., English (E), French (F), German (G), Italian (I), and Spanish (S).
13 PAPERS • 1 BENCHMARK
The DAWT dataset consists of Densely Annotated Wikipedia Texts across multiple languages. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of the entity. The data set contains total of 13.6M articles, 5.0B tokens, 13.8M mention entity co-occurrences. DAWT contains 4.8 times more anchor text to entity links than originally present in the Wikipedia markup. Moreover, it spans several languages including English, Spanish, Italian, German, French and Arabic.
3 PAPERS • NO BENCHMARKS YET
The dataset contains the main components of the news articles published online by the newspaper named <a href="https://gazzettadimodena.gelocal.it/modena">Gazzetta di Modena</a>: url of the web page, title, sub-title, text, date of publication, crime category assigned to each news article by the author.
A general purpose text categorization dataset (NatCat) from three online resources: Wikipedia, Reddit, and Stack Exchange. These datasets consist of document-category pairs derived from manual curation that occurs naturally by their communities.
1 PAPER • NO BENCHMARKS YET
About the MNAD Dataset The MNAD corpus is a collection of over 1 million Moroccan news articles written in modern Arabic language. These news articles have been gathered from 11 prominent electronic news sources. The dataset is made available to the academic community for research purposes, such as data mining (clustering, classification, etc.), information retrieval (ranking, search, etc.), and other non-commercial activities.
0 PAPER • NO BENCHMARKS YET
Este conjunto de datos consiste en comentarios de publicaciones del MINSA (Perú) en Facebook sobre la vacuna contra el VPH entre los años 2019 y 2020. Se leyó cuidadosamente cada uno de los comentarios, luego se procedió a clasificarlos de manera manual. Para esta clasificación se interpretó los mensajes de las personas, por lo que se analizó los hilos (comentarios y respuestas) por separado y se procedió a etiquetarlos por temas "Topic" . Un profesional de salud realizó una segunda clasificación y las discrepancias se resolvieron con un tercer profesional. Luego, se seleccionaron subcategorías que hacían referencia directa a las vacunas contra el VPH. La clasificación se realizó utilizando las siguientes categorías "topic_c" :