🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

20 dataset results for Cross-Modal Retrieval

MS COCO (Microsoft Common Objects in Context)

The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.

10,098 PAPERS • 92 BENCHMARKS

Flickr30k

The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators.

727 PAPERS • 9 BENCHMARKS

NUS-WIDE

The NUS-WIDE dataset contains 269,648 images with a total of 5,018 tags collected from Flickr. These images are manually annotated with 81 concepts, including objects and scenes.

320 PAPERS • 3 BENCHMARKS

PASCAL VOC 2007

PASCAL VOC 2007 is a dataset for image recognition. The twenty object classes that have been selected are:

119 PAPERS • 14 BENCHMARKS

CUHK-PEDES

The CUHK-PEDES dataset is a caption-annotated pedestrian dataset. It contains 40,206 images over 13,003 persons. Images are collected from five existing person re-identification datasets, CUHK03, Market-1501, SSM, VIPER, and CUHK01 while each image is annotated with 2 text descriptions by crowd-sourcing workers. Sentences incorporate rich details about person appearances, actions, poses.

74 PAPERS • 3 BENCHMARKS

Recipe1M+

Recipe1M+ is a dataset which contains one million structured cooking recipes with 13M associated images.

62 PAPERS • 3 BENCHMARKS

RSICD (Remote Sensing Image Captioning Dataset)

The Remote Sensing Image Captioning Dataset (RSICD) is a dataset for remote sensing image captioning task. It contains more than ten thousands remote sensing images which are collected from Google Earth, Baidu Map, MapABC and Tianditu. The images are fixed to 224X224 pixels with various resolutions. The total number of remote sensing images is 10921, with five sentences descriptions per image.

40 PAPERS • 3 BENCHMARKS

ChEBI-20

Dataset contains 33,010 molecule-description pairs split into 80\%/10\%/10\% train/val/test splits. The goal of the task is to retrieve the relevant molecule for a natural language description. It is defined as follows:

22 PAPERS • 4 BENCHMARKS

RSITMD

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

14 PAPERS • 1 BENCHMARK

SemArt

SemArt is a multi-modal dataset for semantic art understanding. SemArt is a collection of fine-art painting images in which each image is associated to a number of attributes and a textual artistic comment, such as those that appear in art catalogues or museum collections. It contains 21,384 samples that provides artistic comments along with fine-art paintings and their attributes for studying semantic art understanding.

13 PAPERS • NO BENCHMARKS YET

ChineseFoodNet

ChineseFoodNet aims to automatically recognizing pictured Chinese dishes. Most of the existing food image datasets collected food images either from recipe pictures or selfie. In the dataset, images of each food category of the dataset consists of not only web recipe and menu pictures but photos taken from real dishes, recipe and menu as well. ChineseFoodNet contains over 180,000 food photos of 208 categories, with each category covering a large variations in presentations of same Chinese food.

6 PAPERS • NO BENCHMARKS YET

SoundingEarth

SoundingEarth consists of co-located aerial imagery and audio samples all around the world.

5 PAPERS • 1 BENCHMARK

Twitter100k

Twitter100k is a large-scale dataset for weakly supervised cross-media retrieval.

4 PAPERS • NO BENCHMARKS YET

CTC

CTC (COCO-Text Captioned)

A dataset that allows exploration of cross-modal retrieval where images contain scene-text instances.

3 PAPERS • NO BENCHMARKS YET

Flickr-8k

Contains 8k flickr Images with captions. Visit this page to explore the data.

3 PAPERS • 1 BENCHMARK

CiNAT-Birds-2021 (Cross-View iNaturalist Birds 2021)

CiNAT Birds 2021 (Cross-View iNaturalist-2021 Birds) dataset contains ground-level images of bird species along with satellite images associated with the geolocation of the ground-level images. In total, there are 413,959 pairs for training and 14,831 pairs for validation and testing. The ground-level images are of varying sizes while the satellite images are of size 256x256. Additionally, the dataset comes with rich metadata for each image - geolocation, date, observer id, taxonomy.

1 PAPER • NO BENCHMARKS YET

Earth on Canvas

A Zero-Shot Sketch-based Inter-Modal Object Retrieval Scheme for Remote Sensing Images

1 PAPER • NO BENCHMARKS YET

IAPR TC-12 (IAPR TC-12 Benchmark)

The image collection of the IAPR TC-12 Benchmark consists of 20,000 still natural images taken from locations around the world and comprising an assorted cross-section of still natural images. This includes pictures of different sports and actions, photographs of people, animals, cities, landscapes, and many other aspects of contemporary life. Each image is associated with a text caption in up to three different languages (English, German and Spanish).

1 PAPER • NO BENCHMARKS YET

PoseScript

PoseScript is a dataset that pairs a few thousand 3D human poses from AMASS with rich human-annotated descriptions of the body parts and their spatial relationships. This dataset is designed for the retrieval of relevant poses from large-scale datasets and synthetic pose generation, both based on a textual pose description.

1 PAPER • NO BENCHMARKS YET

Song Describer Dataset

The Song Describer Dataset (SDD) contains ~1.1k captions for 706 permissively licensed music recordings. It is designed for use in evaluation of models that address music-and-language (M&L) tasks such as music captioning, text-to-music generation and music-language retrieval.

1 PAPER • NO BENCHMARKS YET

Datasets

20 dataset results for Cross-Modal Retrieval