TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Question Answering	How2QA	ATP	Accuracy	65.1	# 6
Video Question Answering	MSR-VTT-MC	ATP (1<-16)	Accuracy	93.2	# 1
Video Question Answering	NExT-QA	ATP	Accuracy	54.3	# 22
Video Question Answering	STAR Benchmark	Temp[ATP]	Average Accuracy	48.37	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/revisiting-the-video-in-video-language/video-question-answering-on-msr-vtt-mc)](https://paperswithcode.com/sota/video-question-answering-on-msr-vtt-mc?p=revisiting-the-video-in-video-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/revisiting-the-video-in-video-language/video-question-answering-on-how2qa)](https://paperswithcode.com/sota/video-question-answering-on-how2qa?p=revisiting-the-video-in-video-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/revisiting-the-video-in-video-language/video-question-answering-on-situated)](https://paperswithcode.com/sota/video-question-answering-on-situated?p=revisiting-the-video-in-video-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/revisiting-the-video-in-video-language/video-question-answering-on-next-qa)](https://paperswithcode.com/sota/video-question-answering-on-next-qa?p=revisiting-the-video-in-video-language)`

Revisiting the "Video" in Video-Language Understanding

CVPR 2022 · Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, Juan Carlos Niebles ·

What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding. We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

stanfordvl/atp-video-language

Tasks

Add Remove

Benchmarking

Question Answering

Retrieval

Text to Video Retrieval

Video Question Answering

Video Retrieval

Datasets

Visual Question Answering

MSR-VTT

NExT-QA

How2QA MSRVTT-MC

STAR Benchmark

Results from the Paper

Edit

Ranked #1 on Video Question Answering on MSR-VTT-MC

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Question Answering	How2QA	ATP	Accuracy	65.1	# 6	Compare
Video Question Answering	MSR-VTT-MC	ATP (1<-16)	Accuracy	93.2	# 1	Compare
Video Question Answering	NExT-QA	ATP	Accuracy	54.3	# 22	Compare
Video Question Answering	STAR Benchmark	Temp[ATP]	Average Accuracy	48.37	# 6	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Revisiting the "Video" in Video-Language Understanding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove