The Visual Spatial Reasoning (VSR) corpus is a collection of caption-image pairs with true/false labels. Each caption describes the spatial relation of two individual objects in the image, and a vision-language model (VLM) needs to judge whether the caption is correctly describing the image (True) or not (False).
41 PAPERS • 1 BENCHMARK
e-SNLI-VE is a large VL (vision-language) dataset with NLEs (natural language explanations) with over 430k instances for which the explanations rely on the image content. It has been built by merging the explanations from e-SNLI and the image-sentence pairs from SNLI-VE.
15 PAPERS • 2 BENCHMARKS