The UTRSet-Real dataset is a comprehensive, manually annotated dataset specifically curated for Printed Urdu OCR research. It contains over 11,000 printed text line images, each of which has been meticulously annotated. One of the standout features of this dataset is its remarkable diversity, which includes variations in fonts, text sizes, colours, orientations, lighting conditions, noises, styles, and backgrounds. This diversity closely mirrors real-world scenarios, making the dataset highly suitable for training and evaluating models that aim to excel in real-world Urdu text recognition tasks.

The availability of the UTRSet-Real dataset addresses the scarcity of comprehensive real-world printed Urdu OCR datasets. By providing researchers with a valuable resource for developing and benchmarking Urdu OCR models, this dataset promotes standardized evaluation and reproducibility and fosters advancements in the field of Urdu OCR. Further, to complement the UTRSet-Real for training purposes, we also present UTRSet-Synth, a high-quality synthetic dataset closely resembling real-world representations of Urdu text. For more information and details about the UTRSet-Real & UTRSet-Synth datasets, please refer to the paper "UTRNet: High-Resolution Urdu Text Recognition In Printed Documents"

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • CC BY-NC-ND

Modalities


Languages