TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	Human	With Audio (Acc %)	96.3 ± 2.1	# 1
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	Human	Without Audio (Acc %)	90.5 ± 3.1	# 1
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	Late Fusion	With Audio (Acc %)	55.0 ± 1.1	# 4
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	Late Fusion	Without Audio (Acc %)	52.5 ± 1.6	# 5
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	CLIP/AudioCLIP	With Audio (Acc %)	60.0 ± 0.9	# 3
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	CLIP/AudioCLIP	Without Audio (Acc %)	56.3 ± 0.7	# 4
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	UNITER (Large)	Without Audio (Acc %)	60.6 ± 2.2	# 3
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	Merlot Reserve (Large)	With Audio (Acc %)	70.1 ± 1.0	# 2
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	Merlot Reserve (Large)	Without Audio (Acc %)	68.4 ± 0.7	# 2
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	Majority	With Audio (Acc %)	50.4	# 5
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	Majority	Without Audio (Acc %)	50.4	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pacs-a-dataset-for-physical-audiovisual/physical-commonsense-reasoning-on-physical)](https://paperswithcode.com/sota/physical-commonsense-reasoning-on-physical?p=pacs-a-dataset-for-physical-audiovisual)`

PACS: A Dataset for Physical Audiovisual CommonSense Reasoning

21 Mar 2022 · Samuel Yu, Peter Wu, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency ·

In order for AI to be safely deployed in real-world scenarios such as hospitals, schools, and the workplace, it must be able to robustly reason about the physical world. Fundamental to this reasoning is physical common sense: understanding the physical properties and affordances of available objects, how they can be manipulated, and how they interact with other objects. Physical commonsense reasoning is fundamentally a multi-sensory task, since physical properties are manifested through multiple modalities - two of them being vision and acoustics. Our paper takes a step towards real-world physical commonsense reasoning by contributing PACS: the first audiovisual benchmark annotated for physical commonsense attributes. PACS contains 13,400 question-answer pairs, involving 1,377 unique physical commonsense questions and 1,526 videos. Our dataset provides new opportunities to advance the research field of physical reasoning by bringing audio as a core component of this multimodal problem. Using PACS, we evaluate multiple state-of-the-art models on our new challenging task. While some models show promising results (70% accuracy), they all fall short of human performance (95% accuracy). We conclude the paper by demonstrating the importance of multimodal reasoning and providing possible avenues for future research.

PDF Abstract

Code

Add Remove Mark official

samuelyu2002/pacs official

Tasks

Add Remove

Common Sense Reasoning

Multimodal Reasoning

Physical Commonsense Reasoning

Datasets

Introduced in the Paper:

Physical Audiovisual CommonSense

Results from the Paper

Edit

Ranked #1 on Physical Commonsense Reasoning on Physical Audiovisual CommonSense

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	Human	With Audio (Acc %)	96.3 ± 2.1	# 1	Compare
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	Human	Without Audio (Acc %)	90.5 ± 3.1	# 1	Compare
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	Late Fusion	With Audio (Acc %)	55.0 ± 1.1	# 4	Compare
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	Late Fusion	Without Audio (Acc %)	52.5 ± 1.6	# 5	Compare
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	CLIP/AudioCLIP	With Audio (Acc %)	60.0 ± 0.9	# 3	Compare
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	CLIP/AudioCLIP	Without Audio (Acc %)	56.3 ± 0.7	# 4	Compare
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	UNITER (Large)	Without Audio (Acc %)	60.6 ± 2.2	# 3	Compare
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	Merlot Reserve (Large)	With Audio (Acc %)	70.1 ± 1.0	# 2	Compare
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	Merlot Reserve (Large)	Without Audio (Acc %)	68.4 ± 0.7	# 2	Compare
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	Majority	With Audio (Acc %)	50.4	# 5	Compare
Physical Commonsense Reasoning	Physical Audiovisual CommonSense	Majority	Without Audio (Acc %)	50.4	# 6	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

PACS: A Dataset for Physical Audiovisual CommonSense Reasoning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove