TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Question Answer	ActivityNet-QA	CAT-7B	Confidence Score	3.5	# 3
Zero-Shot Video Question Answer	ActivityNet-QA	CAT-7B	Accuracy	50.2	# 4
Zero-Shot Video Question Answer	MSRVTT-QA	CAT-7B	Accuracy	62.1	# 5
Zero-Shot Video Question Answer	MSRVTT-QA	CAT-7B	Confidence Score	3.5	# 2
Video-based Generative Performance Benchmarking	VideoInstruct	CAT-7B	Correctness of Information	3.08	# 4
Video-based Generative Performance Benchmarking	VideoInstruct	CAT-7B	Detail Orientation	3.1	# 2
Video-based Generative Performance Benchmarking	VideoInstruct	CAT-7B	Contextual Understanding	3.49	# 7
Video-based Generative Performance Benchmarking	VideoInstruct	CAT-7B	Temporal Understanding	2.81	# 3
Video-based Generative Performance Benchmarking	VideoInstruct	CAT-7B	Consistency	2.89	# 4
Video-based Generative Performance Benchmarking	VideoInstruct	CAT-7B	mean	3.07	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cat-enhancing-multimodal-large-language-model/zeroshot-video-question-answer-on-activitynet)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-activitynet?p=cat-enhancing-multimodal-large-language-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cat-enhancing-multimodal-large-language-model/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=cat-enhancing-multimodal-large-language-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cat-enhancing-multimodal-large-language-model/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=cat-enhancing-multimodal-large-language-model)`

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

7 Mar 2024 · Qilang Ye, Zitong Yu, Rui Shao, Xinyu Xie, Philip Torr, Xiaochun Cao ·

This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components. Although existing Multimodal Large Language Models (MLLMs) can respond to audio-visual content, these responses are sometimes ambiguous and fail to describe specific audio-visual events. To overcome this limitation, we introduce the CAT, which enhances MLLM in three ways: 1) besides straightforwardly bridging audio and video, we design a clue aggregator that aggregates question-related clues in dynamic audio-visual scenarios to enrich the detailed knowledge required for large language models. 2) CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios. Notably, we collect an audio-visual joint instruction dataset named AVinstruct, to further enhance the capacity of CAT to model cross-semantic correlations. 3) we propose AI-assisted ambiguity-aware direct preference optimization, a strategy specialized in retraining the model to favor the non-ambiguity response and improve the ability to localize specific audio-visual objects. Extensive experimental results demonstrate that CAT outperforms existing methods on multimodal tasks, especially in Audio-Visual Question Answering (AVQA) tasks. The codes and the collected instructions are released at https://github.com/rikeilong/Bay-CAT.

PDF Abstract

Code

Add Remove Mark official

rikeilong/bay-cat official

Tasks

Add Remove

Audio-visual Question Answering

Audio-Visual Question Answering (AVQA)

Language Modelling

Large Language Model

Question Answering

Video-based Generative Performance Benchmarking

Visual Question Answering

Zero-Shot Video Question Answer

Datasets

Visual Question Answering

ActivityNet-QA MSRVTT-QA

MUSIC-AVQA VideoInstruct

Results from the Paper

Add Remove

Ranked #4 on Video-based Generative Performance Benchmarking on VideoInstruct

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Question Answer	ActivityNet-QA	CAT-7B	Confidence Score	3.5	# 3	Compare
Zero-Shot Video Question Answer	ActivityNet-QA	CAT-7B	Accuracy	50.2	# 4	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	CAT-7B	Accuracy	62.1	# 5	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	CAT-7B	Confidence Score	3.5	# 2	Compare
Video-based Generative Performance Benchmarking	VideoInstruct	CAT-7B	Correctness of Information	3.08	# 4	Compare
			Detail Orientation	3.1	# 2	Compare
			Contextual Understanding	3.49	# 7	Compare
			Temporal Understanding	2.81	# 3	Compare
			Consistency	2.89	# 4	Compare
			mean	3.07	# 4	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove