``Caption'' as a Coherence Relation: Evidence and Implications
We study verbs in image{--}text corpora, contrasting \textit{caption} corpora, where texts are explicitly written to characterize image content, with \textit{depiction} corpora, where texts and images may stand in more general relations. Captions show a distinctively limited distribution of verbs, with strong preferences for specific tense, aspect, lexical aspect, and semantic field. These limitations, which appear in data elicited by a range of methods, restrict the utility of caption corpora to inform image retrieval, multimodal document generation, and perceptually-grounded semantic models. We suggest that these limitations reflect the discourse constraints in play when subjects write texts to accompany imagery, so we argue that future development of image{--}text corpora should work to increase the diversity of event descriptions, while looking explicitly at the different ways text and imagery can be coherently related.
PDF Abstract