One For All: Video Conversation is Feasible Without Video Instruction Tuning

27 Sep 2023  ·  Ruyang Liu, Chen Li, Yixiao Ge, Ying Shan, Thomas H. Li, Ge Li ·

The recent progress in Large Language Models (LLM) has spurred various advancements in image-language conversation agents, while how to build a proficient video-based dialogue system is still under exploration. Considering the extensive scale of LLM and visual backbone, minimal GPU memory is left for facilitating effective temporal modeling, which is crucial for comprehending and providing feedback on videos. To this end, we propose Branching Temporal Adapter (BT-Adapter), a novel method for extending image-language pretrained models into the video domain. Specifically, BT-Adapter serves as a plug-and-use temporal modeling branch alongside the pretrained visual encoder, which is tuned while keeping the backbone frozen. Just pretrained once, BT-Adapter can be seamlessly integrated into all image conversation models using this version of CLIP, enabling video conversations without the need for video instructions. Besides, we develop a unique asymmetric token masking strategy inside the branch with tailor-made training tasks for BT-Adapter, facilitating faster convergence and better results. Thanks to BT-Adapter, we are able to empower existing multimodal dialogue models with strong video understanding capabilities without incurring excessive GPU costs. Without bells and whistles, BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours. (2) better performance than current video chatbots without any video instruction tuning. (3) state-of-the-art results of video chatting using video instruction tuning, outperforming previous SOTAs by a large margin.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Zero-Shot Video Retrieval ActivityNet BT-Adapter text-to-video R@1 37.0 # 7
text-to-video R@10 78.9 # 6
text-to-video R@5 66.7 # 6
Zero-Shot Video Question Answer ActivityNet-QA BT-Adapter (zero-shot) Confidence Score 3.2 # 11
Accuracy 46.1 # 9
Video Question Answering ActivityNet-QA BT-Adapter (zero-shot) Accuracy 46.1 # 14
Confidence score 3.6 # 1
Zero-Shot Video Retrieval DiDeMo BT-Adapter text-to-video R@1 35.6 # 13
text-to-video R@5 61.9 # 10
text-to-video R@10 72.6 # 10
Zero-Shot Video Retrieval LSMDC BT-Adapter text-to-video R@1 19.5 # 5
text-to-video R@5 35.9 # 6
text-to-video R@10 45.0 # 5
Zero-Shot Video Retrieval MSR-VTT BT-Adapter text-to-video R@1 40.9 # 9
text-to-video R@5 64.7 # 7
text-to-video R@10 73.5 # 7
Zero-Shot Video Question Answer MSRVTT-QA BT-Adapter (zero-shot) Accuracy 51.2 # 15
Confidence Score 2.9 # 13
Accuracy 51.2 # 15
Confidence Score 2.9 # 13
Zero-Shot Video Question Answer MSVD-QA BT-Adapter (zero-shot) Accuracy 67.0 # 11
Confidence Score 3.6 # 10
Accuracy 67.0 # 11
Confidence Score 3.6 # 10
Video-based Generative Performance Benchmarking (Correctness of Information) VideoInstruct BT-Adapter (zero-shot) gpt-score 2.16 # 10
Video-based Generative Performance Benchmarking VideoInstruct BT-Adapter Correctness of Information 2.68 # 11
Detail Orientation 2.69 # 11
Contextual Understanding 3.27 # 11
Temporal Understanding 2.34 # 11
Consistency 2.46 # 11
mean 2.69 # 11
Video-based Generative Performance Benchmarking VideoInstruct BT-Adapter (zero-shot) Correctness of Information 2.16 # 14
Detail Orientation 2.46 # 14
Contextual Understanding 2.89 # 12
Temporal Understanding 2.13 # 12
Consistency 2.2 # 14
mean 2.46 # 12
Video-based Generative Performance Benchmarking (Temporal Understanding) VideoInstruct BT-Adapter gpt-score 2.34 # 6
Video-based Generative Performance Benchmarking (Temporal Understanding) VideoInstruct BT-Adapter (zero-shot) gpt-score 2.13 # 8
Video-based Generative Performance Benchmarking (Detail Orientation)) VideoInstruct BT-Adapter (zero-shot) gpt-score 2.46 # 10
Video-based Generative Performance Benchmarking (Detail Orientation)) VideoInstruct BT-Adapter gpt-score 2.69 # 7
Video-based Generative Performance Benchmarking (Contextual Understanding) VideoInstruct BT-Adapter gpt-score 3.27 # 6
Video-based Generative Performance Benchmarking (Contextual Understanding) VideoInstruct BT-Adapter (zero-shot) gpt-score 2.89 # 8
Video-based Generative Performance Benchmarking (Consistency) VideoInstruct BT-Adapter gpt-score 2.46 # 6
Video-based Generative Performance Benchmarking (Consistency) VideoInstruct BT-Adapter (zero-shot) gpt-score 2.2 # 10
Video-based Generative Performance Benchmarking (Correctness of Information) VideoInstruct BT-Adapter gpt-score 2.68 # 7

Methods