The Endomapper dataset is the first collection of complete endoscopy sequences acquired during regular medical practice, including slow and careful screening explorations, making secondary use of medical data. Its original purpose is to facilitate the development and evaluation of VSLAM (Visual Simultaneous Localization and Mapping) methods in real endoscopy data. The first release of the dataset is composed of 50 sequences with a total of more than 13 hours of video. It is also the first endoscopic dataset that includes both the computed geometric and photometric endoscope calibration as well as the original calibration videos. Meta-data and annotations associated to the dataset varies from anatomical landmark and description of the procedure labeling, tools segmentation masks, COLMAP 3D reconstructions, simulated sequences with groundtruth and meta-data related to special cases, such as sequences from the same patient. This information will improve the research in endoscopic VSLAM, a
12 PAPERS • NO BENCHMARKS YET
Multimodal C4 (MMC4) is an augmentation of the popular text-only c4 corpus with images interleaved. The corpus contains 103M documents containing 585M images interleaved with 43B English tokens.
Existing hate speech datasets contain only textual data. We create a new manually annotated multimodal hate speech dataset formed by 150,000 tweets, each one of them containing text and an image. We call the dataset MMHS150K.
A dataset of color images corrupted by natural noise due to low-light conditions, together with spatially and intensity-aligned low noise images of the same scenes.
12 PAPERS • 1 BENCHMARK
Next generation task-oriented dialog systems need to understand conversational contexts with their perceived surroundings, to effectively help users in the real-world multimodal environment. Existing task-oriented dialog datasets aimed towards virtual assistance fall short and do not situate the dialog in the user's multimodal context. To overcome, we present a new dataset for Situated and Interactive Multimodal Conversations, SIMMC 2.0, which includes 11K task-oriented user<->assistant dialogs (117K utterances) in the shopping domain, grounded in immersive and photo-realistic scenes. The dialogs are collected using a two-phase pipeline: (1) A novel multimodal dialog simulator generates simulated dialog flows, with an emphasis on diversity and richness of interactions, (2) Manual paraphrasing of the generated utterances to collect diverse referring expressions. We provide an in-depth analysis of the collected dataset, and describe in detail the four main benchmark tasks we propose. Our
12 PAPERS • 2 BENCHMARKS
We construct the long-tailed version of VOC from its 2012 train-val set. It contains 1,142 images from 20 classes, with a maximum of 775 images per class and a minimum of 4 images per class. The ratio of head, medium, and tail classes after splitting is 6:6:8. We evaluate the performance on VOC2007 test set with 4952 images.
One large-scale database for Text-to-Image Person Re-identification, i.e., Text-based Person Retrieval.
11 PAPERS • 2 BENCHMARKS
Dataset Introduction
11 PAPERS • 1 BENCHMARK
P-DukeMTMC-reID is a modified version based on DukeMTMC-reID dataset. There are 12,927 images (665 identifies) in training set, 2,163 images (634 identities) for querying and 9,053 images in the gallery set.
PIE-Bench comprises 700 images featuring 10 distinct editing types. Images are evenly distributed in natural and artificial scenes (e.g., paintings) among four categories: animal, human, indoor, and outdoor. Each image in PIE-Bench includes five annotations: source image prompt, target image prompt, editing instruction, main editing body, and the editing mask. Notably, the editing mask annotation (indicating the anticipated editing region) is crucial in accurate metrics computations as we expect the editing to only occur within a designated area.
Synbols is a dataset generator designed for probing the behavior of learning algorithms. By defining the distribution over latent factors one can craft a dataset specifically tailored to answer specific questions about a given algorithm.
11 PAPERS • NO BENCHMARKS YET
Talk The Walk is a large-scale dialogue dataset grounded in action and perception. The task involves two agents (a “guide” and a “tourist”) that communicate via natural language in order to achieve a common goal: having the tourist navigate to a given target location.
This dataset includes 4,500 fully annotated images (over 30,000 license plate characters) from 150 vehicles in real-world scenarios where both the vehicle and the camera (inside another vehicle) are moving.
The evaluation of human epidermal growth factor receptor 2 (HER2) expression is essential to formulate a precise treatment for breast cancer. The routine evaluation of HER2 is conducted with immunohistochemical techniques (IHC), which is very expensive. Therefore, we propose a breast cancer immunohistochemical (BCI) benchmark attempting to synthesize IHC data directly with the paired hematoxylin and eosin (HE) stained images. The dataset contains 4870 registered image pairs, covering a variety of HER2 expression levels (0, 1+, 2+, 3+).
10 PAPERS • 1 BENCHMARK
FM-IQA is a question-answering dataset containing over 150,000 images and 310,000 freestyle Chinese question-answer pairs and their English translations.
10 PAPERS • NO BENCHMARKS YET
HyperKvasir dataset contains 110,079 images and 374 videos where it captures anatomical landmarks and pathological and normal findings. A total of around 1 million images and video frames altogether.
10 PAPERS • 2 BENCHMARKS
Multimodal material segmentation (MCubeS) dataset contains 500 sets of images from 42 street scenes. Each scene has images for four modalities: RGB, angle of linear polarization (AoLP), degree of linear polarization (DoLP), and near-infrared (NIR). The dataset provides annotated ground truth labels for both material and semantic segmentation for every pixel. The dataset is divided training set with 302 image sets, validation set with 96 image sets, and test set with 102 image sets. Each image has 1224 x 1024 pixels and a total of 20 class labels per pixel.
In this project, we formally present the task of Open-domain Visual Entity recognitioN (OVEN), where a model need to link an image onto a Wikipedia entity with respect to a text query. We construct OVEN-Wiki by re-purposing 14 existing datasets with all labels grounded onto one single label space: Wikipedia entities. OVEN challenges models to select among six million possible Wikipedia entities, making it a general visual recognition benchmark with the largest number of labels.
A large-scale collection of visually-grounded, task-oriented dialogues in English designed to investigate shared dialogue history accumulating during conversation.
SCICAP is a large-scale image captioning dataset that contains real-world scientific figures and captions. SCICAP was constructed using more than two million images from over 290,000 papers collected and released by arXiv.
We propose the first question-answering dataset driven by STEM theorems. We annotated 800 QA pairs covering 350+ theorems spanning across Math, EE&CS, Physics and Finance. The dataset is collected by human experts with very high quality. We provide the dataset as a new benchmark to test the limit of large language models to apply theorems to solve challenging university-level questions. We provide a pipeline in the following to prompt LLMs and evaluate their outputs with WolframAlpha.
ViP-Bench is a comprehensive benchmark designed to assess the capability of multimodal models in understanding visual prompts across multiple dimensions. It aims to evaluate how well these models interpret various visual prompts, including recognition, OCR, knowledge, math, relationship reasoning, and language generation. ViP-Bench includes a diverse set of 303 images and questions, providing a thorough assessment of visual understanding capabilities at the region level. This benchmark sets a foundation for future research into multimodal models with arbitrary visual prompts.
Our dataset which consists of multiple indoor and outdoor experiments for up to 30 m gNB-UE link. In each experiment, we fixed the location of the gNB and move the UE with an increment of roughly one degrees. The table above specifies the direction of user movement with respect to gNB-UE link, distance resolution, and the number of user locations for which we conduct channel measurements. Outdoor 30 m data also contains blockage between 3.9 m to 4.8 m. At each location, we scan the transmission beam and collect data for each beam. By doing so, we can get the full OFDM channels for different locations along the moving trajectory with all the beam angles. Moreover, we use 240 kHz subcarrier spacing, which is consistent with the 5G NR numerology at FR2, so the data we collect will be a true reflection of what a 5G UE will see.
9 PAPERS • NO BENCHMARKS YET
ARCTIC is a dataset of free-form interactions of hands and articulated objects. ARCTIC has 1.2M images paired with accurate 3D meshes for both hands and for objects that move and deform over time. The dataset also provides hand-object contact information.
Adaptiope is a domain adaptation dataset with 123 classes in the three domains synthetic, product and real life. One of the main goals of Adaptiope is to offer a clean and well curated set of images for domain adaptation. This was necessary as many other common datasets in the area suffer from label noise and low quality images. Additionally, Adaptiope's class set was chosen in a way that minimizes the overlap with the class set of the commonly used ImageNet pretraining, therefore preventing information leakage in a domain adaptation setup.
Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable capabilities in visual reasoning with common image styles. However, their robustness against diverse style shifts, crucial for practical applications, remains largely unexplored. In this paper, we propose a new benchmark, BenchLMM, to assess the robustness of LMMs against three different styles: artistic image style, imaging sensor style, and application style, where each style has five sub-styles. Utilizing BenchLMM, we comprehensively evaluate state-of-the-art LMMs and reveal: 1) LMMs generally suffer performance degradation when working with other styles; 2) An LMM performs better than another model in common style does not guarantee its superior performance in other styles; 3) LMMs' reasoning capability can be enhanced by prompting LMMs to predict the style first, based on which we propose a versatile and training-free method for improving LMMs; 4) An intelligent LMM is expected to interpret the causes of
9 PAPERS • 1 BENCHMARK
CIRCO (Composed Image Retrieval on Common Objects in context) is an open-domain benchmarking dataset for Composed Image Retrieval (CIR) based on real-world images from COCO 2017 unlabeled set. It is the first CIR dataset with multiple ground truths and aims to address the problem of false negatives in existing datasets. CIRCO comprises a total of 1020 queries, randomly divided into 220 and 800 for the validation and test set, respectively, with an average of 4.53 ground truths per query.
FMB contains 1500 well-registered infrared and visible image pairs with 14 annotated pixel-level categories. Also, it covers a wide range of pixel variations and various severe environments, e.g., dense fog, heavy rain, and low-light condition. The FMB dataset includes rich scenes under different illumination conditions, so that it enables fusion/segmentation model to improve the generalization ability greatly. We labeled 98.16% of all pixels into 14 different categories including Road, Sidewalk, Building, Traffic Light, Traffic Sign, Vegetation, Sky, Person, Car, Truck, Bus, Motorcycle, Bicycle and Pole, which often appear in real world automatic driving and semantic understanding tasks.
PointQA is a set of datasets for Visual Question Datasets (VQA) that require a pointer to an object in the image to be answered correctly. The different datasets are: PointQA-Local, PointQA-LookTwice and PointQA-General.
The first large demoire dataset. The dataset contains 135,000 image pairs, each containing an image contaminated with moire patterns and its corresponding uncontaminated reference image.
UDIS-D is a large image dataset for image stitching or image registration. It contains different overlap rates, varying degrees of parallax, and variable scenes such as indoor, outdoor, night, dark, snow, and zooming.
V-D4RL provides pixel-based analogues of the popular D4RL benchmarking tasks, derived from the dm_control suite, along with natural extensions of two state-of-the-art online pixel-based continuous control algorithms, DrQ-v2 and DreamerV2, to the offline setting.
VideoLQ consists of videos downloaded from various video hosting sites such as Flickr and YouTube, with a Creative Common license.
4D-OR includes a total of 6734 scenes, recorded by six calibrated RGB-D Kinect sensors 1 mounted to the ceiling of the OR, with one frame-per-second, providing synchronized RGB and depth images. We provide fused point cloud sequences of entire scenes, automatically annotated human 6D poses and 3D bounding boxes for OR objects. Furthermore, we provide SSG annotations for each step of the surgery together with the clinical roles of all the humans in the scenes, e.g., nurse, head surgeon, anesthesiologist.
8 PAPERS • 1 BENCHMARK
BIMCV-COVID19+ dataset is a large dataset with chest X-ray images CXR (CR, DX) and computed tomography (CT) imaging of COVID-19 patients along with their radiographic findings, pathologies, polymerase chain reaction (PCR), immunoglobulin G (IgG) and immunoglobulin M (IgM) diagnostic antibody tests and radiographic reports from Medical Imaging Databank in Valencian Region Medical Image Bank (BIMCV). The findings are mapped onto standard Unified Medical Language System (UMLS) terminology and they cover a wide spectrum of thoracic entities, contrasting with the much more reduced number of entities annotated in previous datasets. Images are stored in high resolution and entities are localized with anatomical labels in a Medical Imaging Data Structure (MIDS) format. In addition, 23 images were annotated by a team of expert radiologists to include semantic segmentation of radiographic findings. Moreover, extensive information is provided, including the patient’s demographic information, type
8 PAPERS • NO BENCHMARKS YET
Bongard-HOI testifies to which extent your few-shot visual learner can quickly induce the true HOI concept from a handful of images and perform reasoning with it. Further, the learner is also expected to transfer the learned few-shot skills to novel HOI concepts compositionally.
Flare7K, the first nighttime flare removal dataset, which is generated based on the observation and statistic of real-world nighttime lens flares. It offers 5,000 scattering flare images and 2,000 reflective flare images, consisting of 25 types of scattering flares and 10 types of reflective flares. The 7,000 flare patterns can be randomly added to the flare-free images, forming the flare-corrupted and flare-free image pairs.
Global WHEAT Dataset is the first large-scale dataset for wheat head detection from field optical images. It included a very large range of cultivars from differents continents. Wheat is a staple crop grown all over the world and consequently interest in wheat phenotyping spans the globe. Therefore, it is important that models developed for wheat phenotyping, such as wheat head detection networks, generalize between different growing environments around the world.
The High-Quality Wide Multi-Channel Attack database (HQ-WMCA) database consists of 2904 short multi-modal video recordings of both bona-fide and presentation attacks. There are 555 bonafide presentations from 51 participants and the remaining 2349 are presentation attacks. The data is recorded from several channels including color, depth, thermal, infrared (spectra), and short-wave infrared (spectra).
Given 10 minimally contrastive (highly similar) images and a complex description for one of them, the task is to retrieve the correct image. The source of most images are videos and descriptions as well as retrievals come from human.
The MMVP (Multimodal Visual Patterns) Benchmark focuses on identifying "CLIP-blind pairs" – images that appear similar to the CLIP model despite having clear visual differences. These patterns highlight the challenges these systems face in answering straightforward questions, often leading to incorrect responses and hallucinated explanations.
We build a large-scale, comprehensive, and high-quality synthetic dataset for city-scale neural rendering researches. Leveraging the Unreal Engine 5 City Sample project, we developed a pipeline to easily collect aerial and street city views with ground-truth camera poses, as well as a series of additional data modalities. Flexible control on environmental factors like light, weather, human and car crowd is also available in our pipeline, supporting the need of various tasks covering city-scale neural rendering and beyond. The resulting pilot dataset, MatrixCity, contains 67k aerial images and 452k street images from two city maps of total size 28km^2.
The OPRA Dataset was introduced in Demo2Vec: Reasoning Object Affordances From Online Videos (CVPR'18) for reasoning object affordances from online demonstration videos. It contains 11,505 demonstration clips and 2,512 object images scraped from 6 popular YouTube product review channels along with the corresponding affordance annotations. More details can be found on our https://sites.google.com/view/demo2vec/.
8 PAPERS • 2 BENCHMARKS
Open-Platypus is a family of fine-tuned and merged Large Language Models (LLMs) that achieves the strongest performance and currently stands at first place in HuggingFace's Open LLM Leaderboard.
OpenLane-V2 is the world's first perception and reasoning benchmark for scene structure in autonomous driving. The primary task of the dataset is scene structure perception and reasoning, which requires the model to recognize the dynamic drivable states of lanes in the surrounding environment. The challenge of this dataset includes not only detecting lane centerlines and traffic elements but also recognizing the attribute of traffic elements and topology relationships on detected objects.
TMED is a clinically-motivated benchmark dataset for computer vision and machine learning from limited labeled data.
A Multi-Task 4D Radar-Camera Fusion Dataset for Autonomous Driving on Water Surfaces description of the dataset
e-ViL is a benchmark for explainable vision-language tasks. e-ViL spans across three datasets of human-written NLEs (natural language explanations), and provides a unified evaluation framework that is designed to be re-usable for future works.