NExT-GQA

Introduced by Xiao et al. in Can I Trust Your Answer? Visually Grounded Video Question Answering

We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video-language understanding. Specifically, by forcing vision-language models (VLMs) to answer questions and simultaneously provide visual evidence, we seek to ascertain the extent to which the predictions of such techniques are genuinely anchored in relevant video content, versus spurious correlations from language or irrelevant visual context. Towards this, we construct NExT-GQA -- an extension of NExT-QA with 10.5K temporal grounding (or location) labels tied to the original QA pairs. With NExT-GQA, we scrutinize a variety of state-of-the-art VLMs. Through post-hoc attention analysis, we find that these models are weak in substantiating the answers despite their strong QA performance. This exposes a severe limitation of these models in making reliable predictions.

Homepage

Benchmarks

Add a new result Link an existing benchmark

Trend	Task	Dataset Variant	Best Model	Paper	Code
	Zero-Shot Video Question Answer	NExT-GQA	LLoVi

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

No data loaders found. You can submit your data loader here.

Tasks

Zero-Shot Video Question Answer

Similar Datasets

ActivityNet-QA

NExT-QA

EgoSchema

IntentQA

NExT-GQA

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

ActivityNet-QA

NExT-QA

EgoSchema

IntentQA

Usage

License

Modalities

Languages

NExT-GQA

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

ActivityNet-QA

NExT-QA

EgoSchema

IntentQA

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages