MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.
4 PAPERS • 3 BENCHMARKS
Digitally Generated Numerals (DIGITal) Description The Digitally Generated Numerals (DIGITal) dataset consists of 100,000 image pairs representing digits from 0 to 9. These image pairs include both low and high-quality versions, with a resolution of 128x128 pixels.
1 PAPER • NO BENCHMARKS YET
The UTRSet-Real dataset is a comprehensive, manually annotated dataset specifically curated for Printed Urdu OCR research. It contains over 11,000 printed text line images, each of which has been meticulously annotated. One of the standout features of this dataset is its remarkable diversity, which includes variations in fonts, text sizes, colours, orientations, lighting conditions, noises, styles, and backgrounds. This diversity closely mirrors real-world scenarios, making the dataset highly suitable for training and evaluating models that aim to excel in real-world Urdu text recognition tasks.
1 PAPER • 1 BENCHMARK
The UTRSet-Synth dataset is introduced as a complementary training resource to the UTRSet-Real Dataset, specifically designed to enhance the effectiveness of Urdu OCR models. It is a high-quality synthetic dataset comprising 20,000 lines that closely resemble real-world representations of Urdu text.
The UrduDoc Dataset is a benchmark dataset for Urdu text line detection in scanned documents. It is created as a byproduct of the UTRSet-Real dataset generation process. Comprising 478 diverse images collected from various sources such as books, documents, manuscripts, and newspapers, it offers a valuable resource for research in Urdu document analysis. It includes 358 pages for training and 120 pages for validation, featuring a wide range of styles, scales, and lighting conditions. It serves as a benchmark for evaluating printed Urdu text detection models, and the benchmark results of state-of-the-art models are provided. The Contour-Net model demonstrates the best performance in terms of h-mean.