Automatic image captioning is the task of producing a natural-language utterance (usually a sentence) that correctly reflects the visual content of an image. Up to this point, the resource most used for this task was the MS-COCO dataset, containing around 120,000 images and 5-way image-caption annotations (produced by paid annotators).
320 PAPERS • 2 BENCHMARKS
Fashion-Gen consists of 293,008 high definition (1360 x 1360 pixels) fashion images paired with item descriptions provided by professional stylists. Each item is photographed from a variety of angles.
31 PAPERS • NO BENCHMARKS YET
Multi-Modal-CelebA-HQ is a large-scale face image dataset that has 30,000 high-resolution face images selected from the CelebA dataset by following CelebA-HQ. Each image has high-quality segmentation mask, sketch, descriptive text, and image with transparent background.
27 PAPERS • 3 BENCHMARKS
Pick-a-Pic dataset was created by logging user interactions with the Pick-a-Pic web application for text-to image generation. Overall, the Pick-a-Pic dataset contains over 500,000 examples and 35,000 distinct prompts. Each example contains a prompt, two generated images, and a label for which image is preferred, or if there is a tie when no image is significantly preferred over the other.
18 PAPERS • NO BENCHMARKS YET
T2I-CompBench is a comprehensive benchmark for open-world compositional text-to-image generation, consisting of 6,000 compositional textual prompts from 3 categories (attribute binding, object relationships, and complex compositions) and 6 sub-categories (color binding, shape binding, texture binding, spatial relationships, non-spatial relationships, and complex compositions).
17 PAPERS • NO BENCHMARKS YET
HRS-Bench is a concrete evaluation benchmark for T2I models that is Holistic, Reliable, and Scalable. It measures 13 skills that can be categorized into five major categories: accuracy, robustness, generalization, fairness, and bias. In addition, HRS-Bench covers 50 scenarios, including fashion, animals, transportation, food, and clothes.
3 PAPERS • NO BENCHMARKS YET
Paper2Fig100k is a dataset with over 100k images of figures and texts from research papers. The figures show architecture diagrams and methodologies of articles available at arXiv.org from fields like artificial intelligence and computer vision. Figures usually include text and discrete objects, e.g., boxes in a diagram, with lines and arrows that connect them.
2 PAPERS • NO BENCHMARKS YET
A large dataset of color names and their respective RGB values stores in CSV.
1 PAPER • 1 BENCHMARK
ENTIGEN is a benchmark dataset to evaluate the change in image generations conditional on ethical interventions across three social axes -- gender, skin color, and culture. It contains 246 prompts based on an attribute set containing diverse professions, objects, and cultural scenarios.
1 PAPER • NO BENCHMARKS YET
We introduce GLAMI-1M: the largest multilingual image-text classification dataset and benchmark. The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages. Categorization into 191 classes has high-quality annotations: all 100k images in the test set and 75% of the 1M training set were human-labeled. The paper presents baselines for image-text classification showing that the dataset presents a challenging fine-grained classification problem: The best scoring EmbraceNet model using both visual and textual features achieves 69.7% accuracy. Experiments with a modified Imagen model show the dataset is also suitable for image generation conditioned on text.
This dataset contains prompts designed to evaluate and challenge the safety mechanisms of generative text-to-image models, with a particular focus on identifying prompts that are likely to produce images containing nudity. Introduced in the 2024 ICML paper Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts, this dataset is not specific to any single approach or model but is intended to test various mitigating measures against inappropriate content generation in models like Stable Diffusion. This dataset is only for research purposes.
A Filipino multi-modal language dataset for text+visual tasks. Consists of 351,755 Filipino news articles gathered from Filipino news outlets.
0 PAPER • NO BENCHMARKS YET
We construct a fine-grained video-text dataset with 12K annotated high-resolution videos (~400k clips). The annotation of this dataset is inspired by the video script. If we want to make a video, we have to first write a script to organize how to shoot the scenes in the videos. To shoot a scene, we need to decide the content, shot type (medium shot, close-up, etc), and how the camera moves (panning, tilting, etc). Therefore, we extend video captioning to video scripting by annotating the videos in the format of video scripts. Different from the previous video-text datasets, we densely annotate the entire videos without discarding any scenes and each scene has a caption with ~145 words. Besides the vision modality, we transcribe the voice-over into text and put it along with the video title to give more background information for annotating the videos.