TextCaps

Introduced by Sidorov et al. in TextCaps: a Dataset for Image Captioning with Reading Comprehension

Contains 145k captions for 28k images. The dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects.

Source: TextCaps: a Dataset for Image Captioning with Reading Comprehension

Homepage