Audio-Visual Question Answering (AVQA)

9 papers with code • 1 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Datasets


Most implemented papers

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

OFA-Sys/ONE-PEACE 18 May 2023

In this work, we explore a scalable way for building a general representation model toward unlimited modalities.

Hierarchical Conditional Relation Networks for Video Question Answering

thaolmk54/hcrn-videoqa CVPR 2020

Video question answering (VideoQA) is challenging as it requires modeling capacity to distill dynamic visual artifacts and distant relations and to associate them with linguistic concepts.

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

GeWu-Lab/MUSIC-AVQA CVPR 2022

In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos.

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

TXH-mercury/VALOR 17 Apr 2023

Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.

Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios

Bravo5542/TJSTG 21 May 2023

Recent works rely on elaborate target-agnostic parsing of audio-visual scenes for spatial grounding while mistreating audio and video as separate entities for temporal grounding.

Progressive Spatio-temporal Perception for Audio-Visual Question Answering

gewu-lab/pstp-net 10 Aug 2023

Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest.

Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering

zhangbin-ai/apl 20 Dec 2023

These selected pairs are constrained to have larger similarity values than the mismatched pairs.

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

rikeilong/bay-cat 7 Mar 2024

This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components.

Answering Diverse Questions via Text Attached with Key Audio-Visual Clues

rikeilong/mcd-foravqa 11 Mar 2024

Audio-visual question answering (AVQA) requires reference to video content and auditory information, followed by correlating the question to predict the most precise answer.