TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Optical Character Recognition (OCR)	SUT	Tesseract	Character Error Rate (CER)	0.083	# 1
Optical Character Recognition (OCR)	SUT	EasyOCR	Character Error Rate (CER)	0.072	# 2
Document Image Classification	SUT	CNN	Accuracy	86%	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/sut-a-new-multi-purpose-synthetic-dataset-for/optical-character-recognition-ocr-on-sut)](https://paperswithcode.com/sota/optical-character-recognition-ocr-on-sut?p=sut-a-new-multi-purpose-synthetic-dataset-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/sut-a-new-multi-purpose-synthetic-dataset-for/document-image-classification-on-sut)](https://paperswithcode.com/sota/document-image-classification-on-sut?p=sut-a-new-multi-purpose-synthetic-dataset-for)`

SUT: a new multi-purpose synthetic dataset for Farsi document image analysis

13th International Conference on Computer and Knowledge Engineering (ICCKE) 2023 · Elham Shabaninia, Fatemeh sadat Eslami, Ali Afkari Fahandari, Hossein Nezamabadi-pour ·

This paper introduces a new large-scale dataset for Farsi document images, named SUT, which aims to tackle the challenges associated with obtaining diverse and substantial ground-truth data for supervised models in document image analysis (DIA) tasks, such as document image classification, text detection and recognition, and information retrieval. The dataset comprises 62,453 images that have been categorized into 21 distinct classes, including identity documents featuring synthetically generated personal information superimposed on various backgrounds. The dataset also includes corresponding files with labeling information for the images. The ground-truth data is organized in CSV files containing compiled image file paths and associated information about the embedded data. To demonstrate the efficacy of the SUT dataset in DIA tasks, it was utilized for document classification (achieving an accuracy of 86% using a convolutional neural network) and OCR (achieving a CER of 0.083 and 0.072 using Tesseract and EasyOCR engines, respectively). The SUT dataset represents a valuable resource for researchers who are interested in developing and evaluating supervised models in Farsi document image analysis.

PDF

Code

Add Remove Mark official

aliiafkari/SUT_Dataset

Tasks

Add Remove

Document Classification

Document Image Classification

Image Classification

Information Retrieval

Optical Character Recognition (OCR)

Retrieval

Text Detection

Datasets

SUT

Results from the Paper

Add Remove

Ranked #1 on Optical Character Recognition (OCR) on SUT

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Optical Character Recognition (OCR)	SUT	Tesseract	Character Error Rate (CER)	0.083	# 1	Compare
Optical Character Recognition (OCR)	SUT	EasyOCR	Character Error Rate (CER)	0.072	# 2	Compare
Document Image Classification	SUT	CNN	Accuracy	86%	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

SUT: a new multi-purpose synthetic dataset for Farsi document image analysis

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove