UTRSet-Synth Dataset | Papers With Code

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

The **UTRSet-Synth** dataset is introduced as a complementary training resource to the [**UTRSet-Real** Dataset](https://paperswithcode.com/dataset/utrset-real), specifically designed to enhance the effectiveness of Urdu OCR models. It is a high-quality synthetic dataset comprising 20,000 lines that closely resemble real-world representations of Urdu text.

To generate the dataset, a custom-designed synthetic data generation module which offers precise control over variations in crucial factors such as font, text size, colour, resolution, orientation, noise, style, and background, was employed. Moreover, the UTRSet-Synth dataset tackles the limitations observed in existing datasets. It addresses the challenge of standardizing fonts by incorporating over 130 diverse Urdu fonts, which were thoroughly refined to ensure consistent rendering schemes. It overcomes the scarcity of Arabic words, numerals, and Urdu digits by incorporating a significant number of samples representing these elements. Additionally, the dataset is enriched by randomly selecting words from a vocabulary of 100,000 words during the text generation process. As a result, UTRSet-Synth contains a total of 28,187 unique words, with an average word length of 7 characters.

The availability of the UTRSet-Synth dataset, a synthetic dataset that closely emulates real-world variations, addresses the scarcity of comprehensive real-world printed Urdu OCR datasets. By providing researchers with a valuable resource for developing and benchmarking Urdu OCR models, this dataset promotes standardized evaluation, and reproducibility, and fosters advancements in the field of Urdu OCR. For more information and details about the [UTRSet-Real](https://paperswithcode.com/dataset/utrset-real) & [UTRSet-Synth](https://paperswithcode.com/dataset/utrset-synth) datasets, please refer to the paper ["UTRNet: High-Resolution Urdu Text Recognition In Printed Documents"](https://arxiv.org/abs/2306.15782)

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

Currently

datasets/51e62a1b-e2a9-426c-8853-72fd534468e9.png Clear

Change

---

UTRSet-Synth

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

UrduDoc

UTRSet-Real

Usage

License

Modalities

Languages

UTRSet-Synth

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit