TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Retrieval	ActivityNet	BT-Adapter	text-to-video R@1	37.0	# 7
Zero-Shot Video Retrieval	ActivityNet	BT-Adapter	text-to-video R@10	78.9	# 6
Zero-Shot Video Retrieval	ActivityNet	BT-Adapter	text-to-video R@5	66.7	# 6
Zero-Shot Video Question Answer	ActivityNet-QA	BT-Adapter (zero-shot)	Confidence Score	3.2	# 11
Zero-Shot Video Question Answer	ActivityNet-QA	BT-Adapter (zero-shot)	Accuracy	46.1	# 9
Video Question Answering	ActivityNet-QA	BT-Adapter (zero-shot)	Accuracy	46.1	# 14
Video Question Answering	ActivityNet-QA	BT-Adapter (zero-shot)	Confidence score	3.6	# 1
Zero-Shot Video Retrieval	DiDeMo	BT-Adapter	text-to-video R@1	35.6	# 13
Zero-Shot Video Retrieval	DiDeMo	BT-Adapter	text-to-video R@5	61.9	# 10
Zero-Shot Video Retrieval	DiDeMo	BT-Adapter	text-to-video R@10	72.6	# 10
Zero-Shot Video Retrieval	LSMDC	BT-Adapter	text-to-video R@1	19.5	# 5
Zero-Shot Video Retrieval	LSMDC	BT-Adapter	text-to-video R@5	35.9	# 6
Zero-Shot Video Retrieval	LSMDC	BT-Adapter	text-to-video R@10	45.0	# 5
Zero-Shot Video Retrieval	MSR-VTT	BT-Adapter	text-to-video R@1	40.9	# 9
Zero-Shot Video Retrieval	MSR-VTT	BT-Adapter	text-to-video R@5	64.7	# 7
Zero-Shot Video Retrieval	MSR-VTT	BT-Adapter	text-to-video R@10	73.5	# 7
Zero-Shot Video Question Answer	MSRVTT-QA	BT-Adapter (zero-shot)	Accuracy	51.2	# 15
Zero-Shot Video Question Answer	MSRVTT-QA	BT-Adapter (zero-shot)	Confidence Score	2.9	# 13
Zero-Shot Video Question Answer	MSRVTT-QA	BT-Adapter (zero-shot)	Accuracy	51.2	# 15
Zero-Shot Video Question Answer	MSRVTT-QA	BT-Adapter (zero-shot)	Confidence Score	2.9	# 13
Zero-Shot Video Question Answer	MSVD-QA	BT-Adapter (zero-shot)	Accuracy	67.0	# 11
Zero-Shot Video Question Answer	MSVD-QA	BT-Adapter (zero-shot)	Confidence Score	3.6	# 10
Zero-Shot Video Question Answer	MSVD-QA	BT-Adapter (zero-shot)	Accuracy	67.0	# 11
Zero-Shot Video Question Answer	MSVD-QA	BT-Adapter (zero-shot)	Confidence Score	3.6	# 10
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	BT-Adapter (zero-shot)	gpt-score	2.16	# 10
Video-based Generative Performance Benchmarking	VideoInstruct	BT-Adapter	Correctness of Information	2.68	# 11
Video-based Generative Performance Benchmarking	VideoInstruct	BT-Adapter	Detail Orientation	2.69	# 11
Video-based Generative Performance Benchmarking	VideoInstruct	BT-Adapter	Contextual Understanding	3.27	# 11
Video-based Generative Performance Benchmarking	VideoInstruct	BT-Adapter	Temporal Understanding	2.34	# 11
Video-based Generative Performance Benchmarking	VideoInstruct	BT-Adapter	Consistency	2.46	# 11
Video-based Generative Performance Benchmarking	VideoInstruct	BT-Adapter	mean	2.69	# 11
Video-based Generative Performance Benchmarking	VideoInstruct	BT-Adapter (zero-shot)	Correctness of Information	2.16	# 14
Video-based Generative Performance Benchmarking	VideoInstruct	BT-Adapter (zero-shot)	Detail Orientation	2.46	# 14
Video-based Generative Performance Benchmarking	VideoInstruct	BT-Adapter (zero-shot)	Contextual Understanding	2.89	# 12
Video-based Generative Performance Benchmarking	VideoInstruct	BT-Adapter (zero-shot)	Temporal Understanding	2.13	# 12
Video-based Generative Performance Benchmarking	VideoInstruct	BT-Adapter (zero-shot)	Consistency	2.2	# 14
Video-based Generative Performance Benchmarking	VideoInstruct	BT-Adapter (zero-shot)	mean	2.46	# 12
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	BT-Adapter	gpt-score	2.34	# 6
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	BT-Adapter (zero-shot)	gpt-score	2.13	# 8
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	BT-Adapter (zero-shot)	gpt-score	2.46	# 10
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	BT-Adapter	gpt-score	2.69	# 7
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	BT-Adapter	gpt-score	3.27	# 6
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	BT-Adapter (zero-shot)	gpt-score	2.89	# 8
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	BT-Adapter	gpt-score	2.46	# 6
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	BT-Adapter (zero-shot)	gpt-score	2.2	# 10
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	BT-Adapter	gpt-score	2.68	# 7

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-for-all-video-conversation-is-feasible/zero-shot-video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-lsmdc?p=one-for-all-video-conversation-is-feasible)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-for-all-video-conversation-is-feasible/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=one-for-all-video-conversation-is-feasible)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-for-all-video-conversation-is-feasible/video-based-generative-performance-3)](https://paperswithcode.com/sota/video-based-generative-performance-3?p=one-for-all-video-conversation-is-feasible)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-for-all-video-conversation-is-feasible/video-based-generative-performance-2)](https://paperswithcode.com/sota/video-based-generative-performance-2?p=one-for-all-video-conversation-is-feasible)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-for-all-video-conversation-is-feasible/zero-shot-video-retrieval-on-activitynet)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-activitynet?p=one-for-all-video-conversation-is-feasible)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-for-all-video-conversation-is-feasible/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=one-for-all-video-conversation-is-feasible)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-for-all-video-conversation-is-feasible/video-based-generative-performance-1)](https://paperswithcode.com/sota/video-based-generative-performance-1?p=one-for-all-video-conversation-is-feasible)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-for-all-video-conversation-is-feasible/zeroshot-video-question-answer-on-activitynet)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-activitynet?p=one-for-all-video-conversation-is-feasible)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-for-all-video-conversation-is-feasible/zero-shot-video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=one-for-all-video-conversation-is-feasible)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-for-all-video-conversation-is-feasible/zeroshot-video-question-answer-on-msvd-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msvd-qa?p=one-for-all-video-conversation-is-feasible)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-for-all-video-conversation-is-feasible/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=one-for-all-video-conversation-is-feasible)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-for-all-video-conversation-is-feasible/zero-shot-video-retrieval-on-didemo)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-didemo?p=one-for-all-video-conversation-is-feasible)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-for-all-video-conversation-is-feasible/video-question-answering-on-activitynet-qa)](https://paperswithcode.com/sota/video-question-answering-on-activitynet-qa?p=one-for-all-video-conversation-is-feasible)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-for-all-video-conversation-is-feasible/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=one-for-all-video-conversation-is-feasible)`

One For All: Video Conversation is Feasible Without Video Instruction Tuning

27 Sep 2023 · Ruyang Liu, Chen Li, Yixiao Ge, Ying Shan, Thomas H. Li, Ge Li ·

The recent progress in Large Language Models (LLM) has spurred various advancements in image-language conversation agents, while how to build a proficient video-based dialogue system is still under exploration. Considering the extensive scale of LLM and visual backbone, minimal GPU memory is left for facilitating effective temporal modeling, which is crucial for comprehending and providing feedback on videos. To this end, we propose Branching Temporal Adapter (BT-Adapter), a novel method for extending image-language pretrained models into the video domain. Specifically, BT-Adapter serves as a plug-and-use temporal modeling branch alongside the pretrained visual encoder, which is tuned while keeping the backbone frozen. Just pretrained once, BT-Adapter can be seamlessly integrated into all image conversation models using this version of CLIP, enabling video conversations without the need for video instructions. Besides, we develop a unique asymmetric token masking strategy inside the branch with tailor-made training tasks for BT-Adapter, facilitating faster convergence and better results. Thanks to BT-Adapter, we are able to empower existing multimodal dialogue models with strong video understanding capabilities without incurring excessive GPU costs. Without bells and whistles, BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours. (2) better performance than current video chatbots without any video instruction tuning. (3) state-of-the-art results of video chatting using video instruction tuning, outperforming previous SOTAs by a large margin.

PDF Abstract

Code

Add Remove Mark official

farewellthree/BT-Adapter official

Tasks

Add Remove

Video-based Generative Performance Benchmarking

Video-based Generative Performance Benchmarking (Consistency)

Video-based Generative Performance Benchmarking (Contextual Understanding)

Video-based Generative Performance Benchmarking (Correctness of Information)

Video-based Generative Performance Benchmarking (Detail Orientation))

Video-based Generative Performance Benchmarking (Temporal Understanding)

Video Question Answering

Video Understanding

Zero-Shot Video Question Answer

Zero-Shot Video Retrieval

Datasets

ActivityNet

MSR-VTT

DiDeMo

WebVid

LSMDC

ActivityNet-QA MSRVTT-QA MSVD-QA VideoInstruct

Results from the Paper

Edit

Ranked #5 on Zero-Shot Video Retrieval on LSMDC

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Retrieval	ActivityNet	BT-Adapter	text-to-video R@1	37.0	# 7	Compare
			text-to-video R@10	78.9	# 6	Compare
			text-to-video R@5	66.7	# 6	Compare
Zero-Shot Video Question Answer	ActivityNet-QA	BT-Adapter (zero-shot)	Confidence Score	3.2	# 11	Compare
Zero-Shot Video Question Answer	ActivityNet-QA	BT-Adapter (zero-shot)	Accuracy	46.1	# 9	Compare
Video Question Answering	ActivityNet-QA	BT-Adapter (zero-shot)	Accuracy	46.1	# 14	Compare
Video Question Answering	ActivityNet-QA	BT-Adapter (zero-shot)	Confidence score	3.6	# 1	Compare
Zero-Shot Video Retrieval	DiDeMo	BT-Adapter	text-to-video R@1	35.6	# 13	Compare
			text-to-video R@5	61.9	# 10	Compare
			text-to-video R@10	72.6	# 10	Compare
Zero-Shot Video Retrieval	LSMDC	BT-Adapter	text-to-video R@1	19.5	# 5	Compare
			text-to-video R@5	35.9	# 6	Compare
			text-to-video R@10	45.0	# 5	Compare
Zero-Shot Video Retrieval	MSR-VTT	BT-Adapter	text-to-video R@1	40.9	# 9	Compare
			text-to-video R@5	64.7	# 7	Compare
			text-to-video R@10	73.5	# 7	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	BT-Adapter (zero-shot)	Accuracy	51.2	# 15	Compare
			Confidence Score	2.9	# 13	Compare
			Accuracy	51.2	# 15	Compare
			Confidence Score	2.9	# 13	Compare
Zero-Shot Video Question Answer	MSVD-QA	BT-Adapter (zero-shot)	Accuracy	67.0	# 11	Compare
			Confidence Score	3.6	# 10	Compare
			Accuracy	67.0	# 11	Compare
			Confidence Score	3.6	# 10	Compare
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	BT-Adapter (zero-shot)	gpt-score	2.16	# 10	Compare
Video-based Generative Performance Benchmarking	VideoInstruct	BT-Adapter	Correctness of Information	2.68	# 11	Compare
			Detail Orientation	2.69	# 11	Compare
			Contextual Understanding	3.27	# 11	Compare
			Temporal Understanding	2.34	# 11	Compare
			Consistency	2.46	# 11	Compare
			mean	2.69	# 11	Compare
Video-based Generative Performance Benchmarking	VideoInstruct	BT-Adapter (zero-shot)	Correctness of Information	2.16	# 14	Compare
			Detail Orientation	2.46	# 14	Compare
			Contextual Understanding	2.89	# 12	Compare
			Temporal Understanding	2.13	# 12	Compare
			Consistency	2.2	# 14	Compare
			mean	2.46	# 12	Compare
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	BT-Adapter	gpt-score	2.34	# 6	Compare
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	BT-Adapter (zero-shot)	gpt-score	2.13	# 8	Compare
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	BT-Adapter (zero-shot)	gpt-score	2.46	# 10	Compare
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	BT-Adapter	gpt-score	2.69	# 7	Compare
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	BT-Adapter	gpt-score	3.27	# 6	Compare
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	BT-Adapter (zero-shot)	gpt-score	2.89	# 8	Compare
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	BT-Adapter	gpt-score	2.46	# 6	Compare
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	BT-Adapter (zero-shot)	gpt-score	2.2	# 10	Compare
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	BT-Adapter	gpt-score	2.68	# 7	Compare

Methods

Add Remove

Adapter • CLIP

Edit Social Preview

One For All: Video Conversation is Feasible Without Video Instruction Tuning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove