HierText

HierText is the first dataset featuring hierarchical annotations of text in natural scenes and documents. The dataset contains 11639 images selected from the Open Images dataset, providing high quality word (~1.2M), line, and paragraph level annotations. Text lines are defined as connected sequences of words that are aligned in spatial proximity and are logically connected. Text lines that belong to the same semantic topic and are geometrically coherent form paragraphs. Images in HierText are rich in text, with average of more than 100 words per image.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages