8 dataset results for Cross-Modal Retrieval AND Texts

The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators.

762 PAPERS • 9 BENCHMARKS

CUHK-PEDES

The CUHK-PEDES dataset is a caption-annotated pedestrian dataset. It contains 40,206 images over 13,003 persons. Images are collected from five existing person re-identification datasets, CUHK03, Market-1501, SSM, VIPER, and CUHK01 while each image is annotated with 2 text descriptions by crowd-sourcing workers. Sentences incorporate rich details about person appearances, actions, poses.

80 PAPERS • 4 BENCHMARKS

Recipe1M+

Recipe1M+ is a dataset which contains one million structured cooking recipes with 13M associated images.

62 PAPERS • 3 BENCHMARKS

ChEBI-20

Dataset contains 33,010 molecule-description pairs split into 80\%/10\%/10\% train/val/test splits. The goal of the task is to retrieve the relevant molecule for a natural language description. It is defined as follows:

25 PAPERS • 4 BENCHMARKS

SemArt

SemArt is a multi-modal dataset for semantic art understanding. SemArt is a collection of fine-art painting images in which each image is associated to a number of attributes and a textual artistic comment, such as those that appear in art catalogues or museum collections. It contains 21,384 samples that provides artistic comments along with fine-art paintings and their attributes for studying semantic art understanding.

13 PAPERS • NO BENCHMARKS YET

Twitter100k

Twitter100k is a large-scale dataset for weakly supervised cross-media retrieval.

4 PAPERS • NO BENCHMARKS YET

IAPR TC-12 (IAPR TC-12 Benchmark)

The image collection of the IAPR TC-12 Benchmark consists of 20,000 still natural images taken from locations around the world and comprising an assorted cross-section of still natural images. This includes pictures of different sports and actions, photographs of people, animals, cities, landscapes, and many other aspects of contemporary life. Each image is associated with a text caption in up to three different languages (English, German and Spanish).

1 PAPER • NO BENCHMARKS YET

Song Describer Dataset

The Song Describer Dataset (SDD) contains ~1.1k captions for 706 permissively licensed music recordings. It is designed for use in evaluation of models that address music-and-language (M&L) tasks such as music captioning, text-to-music generation and music-language retrieval.

1 PAPER • NO BENCHMARKS YET

Datasets

8 dataset results for Cross-Modal Retrieval AND Texts