Audio-visual Question Answering
11 papers with code • 1 benchmarks • 1 datasets
Most implemented papers
Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos
However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings.
Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$ Videos
However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings.
Learning to Answer Questions in Dynamic Audio-Visual Scenarios
In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos.
Vision Transformers are Parameter-Efficient Audio-Visual Learners
To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks by injecting a small number of trainable parameters into every layer of a frozen ViT.
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios
Recent works rely on elaborate target-agnostic parsing of audio-visual scenes for spatial grounding while mistreating audio and video as separate entities for temporal grounding.
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).
Progressive Spatio-temporal Perception for Audio-Visual Question Answering
Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest.
Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering
These selected pairs are constrained to have larger similarity values than the mismatched pairs.
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components.