Audio-visual Question Answering

11 papers with code • 1 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in Audio-visual Question Answering

Trend	Dataset	Best Model	Paper	Code	Compare
	MUSIC-AVQA	VAST			See all

Datasets

MUSIC-AVQA

Most implemented papers

Most implemented Social Latest No code

Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos

hs-yn/panoavqa • • ICCV 2021

However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings.

Paper
Code

Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$ Videos

hs-yn/panoavqa • • 11 Oct 2021

However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings.

Paper
Code

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

GeWu-Lab/MUSIC-AVQA • • CVPR 2022

In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos.

Paper
Code

Vision Transformers are Parameter-Efficient Audio-Visual Learners

GenjiB/LAVISH • • CVPR 2023

To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks by injecting a small number of trainable parameters into every layer of a frozen ViT.

Paper
Code

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

TXH-mercury/VALOR • • 17 Apr 2023

Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.

Paper
Code

Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios

Bravo5542/TJSTG • • 21 May 2023

Recent works rely on elaborate target-agnostic parsing of audio-visual scenes for spatial grounding while mistreating audio and video as separate entities for temporal grounding.

Paper
Code

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

txh-mercury/vast • • NeurIPS 2023

Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).

Paper
Code

Progressive Spatio-temporal Perception for Audio-Visual Question Answering

gewu-lab/pstp-net • • 10 Aug 2023

Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest.

Paper
Code