MMVP-VLM

Introduced by Tong et al. in Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

The MMVP-VLM (Multimodal Visual Patterns - Visual Language Models) Benchmark is specifically designed to systematically evaluate the performance of recent CLIP-based models in understanding and processing visual patterns. Let's delve into the details:

Purpose: The MMVP-VLM Benchmark aims to assess how well CLIP models can match image-text combinations that represent distinct visual patterns. It distills a subset of questions from the original MMVP benchmark into simpler language descriptions, categorizing them into different visual patterns. Each visual pattern is represented by 15 text-image pairs.
Dataset Composition:
- Text-Image Pairs: The benchmark includes a balanced number of questions for each visual pattern, with each pattern represented by 15 pairs. These pairs are a subset of the MMVP benchmark, supplemented with additional questions for balance.
- Visual Patterns: The questions cover various visual patterns, allowing evaluation of CLIP models' ability to understand and process these patterns.
Insights and Limitations: By assessing whether CLIP models can accurately match the provided image-text combinations, the MMVP-VLM Benchmark provides insights into the capabilities and limitations of these models.

Homepage

Benchmarks

Add a new result Link an existing benchmark

No benchmarks yet. Start a new benchmark or link an existing one.

Papers

Paper	Code	Results	Date	Stars

MMVP-VLM

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

MMVP

Usage

License

Modalities

Languages

MMVP-VLM

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

MMVP

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages