PACS: A Dataset for Physical Audiovisual CommonSense Reasoning

21 Mar 2022  ·  Samuel Yu, Peter Wu, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency ·

In order for AI to be safely deployed in real-world scenarios such as hospitals, schools, and the workplace, it must be able to robustly reason about the physical world. Fundamental to this reasoning is physical common sense: understanding the physical properties and affordances of available objects, how they can be manipulated, and how they interact with other objects. Physical commonsense reasoning is fundamentally a multi-sensory task, since physical properties are manifested through multiple modalities - two of them being vision and acoustics. Our paper takes a step towards real-world physical commonsense reasoning by contributing PACS: the first audiovisual benchmark annotated for physical commonsense attributes. PACS contains 13,400 question-answer pairs, involving 1,377 unique physical commonsense questions and 1,526 videos. Our dataset provides new opportunities to advance the research field of physical reasoning by bringing audio as a core component of this multimodal problem. Using PACS, we evaluate multiple state-of-the-art models on our new challenging task. While some models show promising results (70% accuracy), they all fall short of human performance (95% accuracy). We conclude the paper by demonstrating the importance of multimodal reasoning and providing possible avenues for future research.

PDF Abstract

Datasets


Introduced in the Paper:

Physical Audiovisual CommonSense
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Physical Commonsense Reasoning Physical Audiovisual CommonSense Human With Audio (Acc %) 96.3 ± 2.1 # 1
Without Audio (Acc %) 90.5 ± 3.1 # 1
Physical Commonsense Reasoning Physical Audiovisual CommonSense Late Fusion With Audio (Acc %) 55.0 ± 1.1 # 4
Without Audio (Acc %) 52.5 ± 1.6 # 5
Physical Commonsense Reasoning Physical Audiovisual CommonSense CLIP/AudioCLIP With Audio (Acc %) 60.0 ± 0.9 # 3
Without Audio (Acc %) 56.3 ± 0.7 # 4
Physical Commonsense Reasoning Physical Audiovisual CommonSense UNITER (Large) Without Audio (Acc %) 60.6 ± 2.2 # 3
Physical Commonsense Reasoning Physical Audiovisual CommonSense Merlot Reserve (Large) With Audio (Acc %) 70.1 ± 1.0 # 2
Without Audio (Acc %) 68.4 ± 0.7 # 2
Physical Commonsense Reasoning Physical Audiovisual CommonSense Majority With Audio (Acc %) 50.4 # 5
Without Audio (Acc %) 50.4 # 6

Methods


No methods listed for this paper. Add relevant methods here