Belfort (The Belfort dataset: Handwritten Text Recognition from Crowdsourced Annotations)

The Belfort dataset

This dataset includes minutes of Belfort municipal council drawn up between 1790 and 1946. Documents include deliberations, lists of councillors, convocations, and agendas. It includes 24,105 text-line images that were automatically detected from pages. Up to 4 transcriptions are available for each line image: two from humans, and two from automatic models.

Files are organized in three folders: Images, Transcriptions, and Partitions.

Images

The dataset include 24,105 text-line images that were automatically detected using a generic Doc-UFCN model, and resized to a fixed height of 128 pixels.

Transcriptions

Up to 4 transcriptions are available for each image, as summarized in the following table:

Folder N transcriptions Description Comments
callico_1/ 24,105 Human annotation n°1 All lines have at least one human annotation
callico_2/ 8,878 Human annotation n°2 Only 33% of lines have two different human annotations
dan/ 24,102 DAN automatic model 3 images have empty transcriptions (no text was predicted by the model)
pylaia/ 23,536 PyLaia automatic model 569 images have empty transcriptions (no text was predicted by the model)
rasa/ 23,287 RASA aggregation algorithm 818 images have empty transcriptions
rover/ 24,104 ROVER aggregation algorithm 1 image has an empty transcription

Data partition

We provide two distinct splits, both of them containing 19,013 training images, 2,262 validation images and 2,830 test images.

  • The Agreement-based split ensures the reliability of the test set:
    • The test set includes lines with perfect agreement between human annotators (Character Error Rate = 0%);
    • The validation set includes lines with good agreement between human annotators (0% < Character Error Rate < 5%);
    • The training set includes all the other lines.
  • The Random split is randomized.

Evaluation

Evaluation results in the paper are computed by comparing predictions to human annotations. Automatic and aggregated transcriptions are only used during model training.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks