ViP-Bench (Making Large Multimodal Models Understand Arbitrary Visual Prompts)

Introduced by Cai et al. in ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

ViP-Bench is a comprehensive benchmark designed to assess the capability of multimodal models in understanding visual prompts across multiple dimensions. It aims to evaluate how well these models interpret various visual prompts, including recognition, OCR, knowledge, math, relationship reasoning, and language generation. ViP-Bench includes a diverse set of 303 images and questions, providing a thorough assessment of visual understanding capabilities at the region level. This benchmark sets a foundation for future research into multimodal models with arbitrary visual prompts.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


Modalities


Languages