The UrduDoc Dataset is a benchmark dataset for Urdu text line detection in scanned documents. It is created as a byproduct of the UTRSet-Real dataset generation process. Comprising 478 diverse images collected from various sources such as books, documents, manuscripts, and newspapers, it offers a valuable resource for research in Urdu document analysis. It includes 358 pages for training and 120 pages for validation, featuring a wide range of styles, scales, and lighting conditions. It serves as a benchmark for evaluating printed Urdu text detection models, and the benchmark results of state-of-the-art models are provided. The Contour-Net model demonstrates the best performance in terms of h-mean.

The UrduDoc dataset is the first of its kind for printed Urdu text line detection and will advance research in the field. It will be made publicly available for non-commercial, academic, and research purposes upon request and execution of a no-cost license agreement. To request the dataset and for more information and details about the UrduDoc , UTRSet-Real & UTRSet-Synth datasets, please refer to the Project Website of our paper "UTRNet: High-Resolution Urdu Text Recognition In Printed Documents"

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • CC BY-NC-ND

Modalities


Languages