Belfort Dataset | Papers With Code

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

# The Belfort dataset

This dataset includes minutes of Belfort municipal council drawn up between 1790 and 1946. Documents include deliberations, lists of councillors, convocations, and agendas. It includes 24,105 text-line images that were automatically detected from pages. Up to 4 transcriptions are available for each line image: two from humans, and two from automatic models.

Files are organized in three folders: `Images`, `Transcriptions`, and `Partitions`.

## Images

The dataset include 24,105 text-line images that were automatically detected using a generic [Doc-UFCN](https://pypi.org/project/doc-ufcn/) model, and resized to a fixed height of 128 pixels.

## Transcriptions
Up to 4 transcriptions are available for each image, as summarized in the following table:

|   Folder   	| N transcriptions 	| Description                 | Comments                                                                          |
|:----------:	|-----------------:	|-----------------------------|-----------------------------------------------------------------------------------|
| callico_1/ 	|           24,105 	| Human annotation n°1        | All lines have at least one human annotation                                      |
| callico_2/ 	|            8,878 	| Human annotation n°2        | Only 33% of lines have two different human annotations                            |
| dan/       	|           24,102 	| DAN automatic model         | 3 images have empty transcriptions (no text was predicted by the model)           |
| pylaia/    	|           23,536 	| PyLaia automatic model      | 569 images have empty transcriptions (no text was predicted by the model)         |
| rasa/     	|           23,287 	| RASA aggregation algorithm  | 818 images have empty transcriptions                                              |
| rover/    	|           24,104 	| ROVER aggregation algorithm | 1 image has an empty transcription                                                |

## Data partition

We provide two distinct splits, both of them containing 19,013 training images, 2,262 validation images and 2,830 test images.

* The *Agreement-based split* ensures the reliability of the test set:
    * The test set includes lines with perfect agreement between human annotators (Character Error Rate = 0%);
    * The validation set includes lines with good agreement between human annotators (0% < Character Error Rate < 5%);
    * The training set includes all the other lines.
* The *Random split* is randomized.

## Evaluation

Evaluation results in the paper are computed by comparing predictions to human annotations.
Automatic and aggregated transcriptions are only used during model training.

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

Currently

datasets/9ddb8aef-d059-4106-aa97-0b361e439847.png Clear

Change

---

Belfort (The Belfort dataset: Handwritten Text Recognition from Crowdsourced Annotations)

The Belfort dataset

Images

Transcriptions

Data partition

Evaluation

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Usage

License

Modalities

Languages

Folder	N transcriptions	Description	Comments
callico_1/	24,105	Human annotation n°1	All lines have at least one human annotation
callico_2/	8,878	Human annotation n°2	Only 33% of lines have two different human annotations
dan/	24,102	DAN automatic model	3 images have empty transcriptions (no text was predicted by the model)
pylaia/	23,536	PyLaia automatic model	569 images have empty transcriptions (no text was predicted by the model)
rasa/	23,287	RASA aggregation algorithm	818 images have empty transcriptions
rover/	24,104	ROVER aggregation algorithm	1 image has an empty transcription