Audio-visual Question Answering

11 papers with code • 1 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Datasets


Most implemented papers

Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos

hs-yn/panoavqa ICCV 2021

However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings.

Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$ Videos

hs-yn/panoavqa 11 Oct 2021

However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings.

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

GeWu-Lab/MUSIC-AVQA CVPR 2022

In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos.

Vision Transformers are Parameter-Efficient Audio-Visual Learners

GenjiB/LAVISH CVPR 2023

To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks by injecting a small number of trainable parameters into every layer of a frozen ViT.

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

TXH-mercury/VALOR 17 Apr 2023

Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.

Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios

Bravo5542/TJSTG 21 May 2023

Recent works rely on elaborate target-agnostic parsing of audio-visual scenes for spatial grounding while mistreating audio and video as separate entities for temporal grounding.

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

txh-mercury/vast NeurIPS 2023

Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).

Progressive Spatio-temporal Perception for Audio-Visual Question Answering

gewu-lab/pstp-net 10 Aug 2023

Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest.

Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering

zhangbin-ai/apl 20 Dec 2023

These selected pairs are constrained to have larger similarity values than the mismatched pairs.

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

rikeilong/bay-cat 7 Mar 2024

This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components.