This dataset includes minutes of Belfort municipal council drawn up between 1790 and 1946. Documents include deliberations, lists of councillors, convocations, and agendas. It includes 24,105 text-line images that were automatically detected from pages. Up to 4 transcriptions are available for each line image: two from humans, and two from automatic models.
Files are organized in three folders: Images
, Transcriptions
, and Partitions
.
The dataset include 24,105 text-line images that were automatically detected using a generic Doc-UFCN model, and resized to a fixed height of 128 pixels.
Up to 4 transcriptions are available for each image, as summarized in the following table:
Folder | N transcriptions | Description | Comments |
---|---|---|---|
callico_1/ | 24,105 | Human annotation n°1 | All lines have at least one human annotation |
callico_2/ | 8,878 | Human annotation n°2 | Only 33% of lines have two different human annotations |
dan/ | 24,102 | DAN automatic model | 3 images have empty transcriptions (no text was predicted by the model) |
pylaia/ | 23,536 | PyLaia automatic model | 569 images have empty transcriptions (no text was predicted by the model) |
rasa/ | 23,287 | RASA aggregation algorithm | 818 images have empty transcriptions |
rover/ | 24,104 | ROVER aggregation algorithm | 1 image has an empty transcription |
We provide two distinct splits, both of them containing 19,013 training images, 2,262 validation images and 2,830 test images.
Evaluation results in the paper are computed by comparing predictions to human annotations. Automatic and aggregated transcriptions are only used during model training.
Paper | Code | Results | Date | Stars |
---|