CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components. Although existing Multimodal Large Language Models (MLLMs) can respond to audio-visual content, these responses are sometimes ambiguous and fail to describe specific audio-visual events. To overcome this limitation, we introduce the CAT, which enhances MLLM in three ways: 1) besides straightforwardly bridging audio and video, we design a clue aggregator that aggregates question-related clues in dynamic audio-visual scenarios to enrich the detailed knowledge required for large language models. 2) CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios. Notably, we collect an audio-visual joint instruction dataset named AVinstruct, to further enhance the capacity of CAT to model cross-semantic correlations. 3) we propose AI-assisted ambiguity-aware direct preference optimization, a strategy specialized in retraining the model to favor the non-ambiguity response and improve the ability to localize specific audio-visual objects. Extensive experimental results demonstrate that CAT outperforms existing methods on multimodal tasks, especially in Audio-Visual Question Answering (AVQA) tasks. The codes and the collected instructions are released at https://github.com/rikeilong/Bay-CAT.
PDF AbstractCode
Results from the Paper
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Zero-Shot Video Question Answer | ActivityNet-QA | CAT-7B | Confidence Score | 3.5 | # 3 | |
Accuracy | 50.2 | # 4 | ||||
Zero-Shot Video Question Answer | MSRVTT-QA | CAT-7B | Accuracy | 62.1 | # 5 | |
Confidence Score | 3.5 | # 2 | ||||
Video-based Generative Performance Benchmarking | VideoInstruct | CAT-7B | Correctness of Information | 3.08 | # 4 | |
Detail Orientation | 3.1 | # 2 | ||||
Contextual Understanding | 3.49 | # 7 | ||||
Temporal Understanding | 2.81 | # 3 | ||||
Consistency | 2.89 | # 4 | ||||
mean | 3.07 | # 4 |