TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
3D Question Answering (3D-QA)	ScanQA Test w/ objects	BridgeQA	Exact Match	31.29	# 1
3D Question Answering (3D-QA)	ScanQA Test w/ objects	BridgeQA	BLEU-1	34.49	# 4
3D Question Answering (3D-QA)	ScanQA Test w/ objects	BridgeQA	BLEU-4	24.06	# 1
3D Question Answering (3D-QA)	ScanQA Test w/ objects	BridgeQA	ROUGE	43.26	# 1
3D Question Answering (3D-QA)	ScanQA Test w/ objects	BridgeQA	METEOR	16.51	# 2
3D Question Answering (3D-QA)	ScanQA Test w/ objects	BridgeQA	CIDEr	83.75	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bridging-the-gap-between-2d-and-3d-visual/3d-question-answering-3d-qa-on-scanqa-test-w)](https://paperswithcode.com/sota/3d-question-answering-3d-qa-on-scanqa-test-w?p=bridging-the-gap-between-2d-and-3d-visual)`

Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA

24 Feb 2024 · Wentao Mo, Yang Liu ·

In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts (e.g., only around 800 scenes are utilized in ScanQA and SQA dataset). Current approaches resort supplement 3D reasoning with 2D information. However, these methods face challenges: either they use top-down 2D views that introduce overly complex and sometimes question-irrelevant visual clues, or they rely on globally aggregated scene/image-level representations from 2D VLMs, losing the fine-grained vision-language correlations. To overcome these limitations, our approach utilizes question-conditional 2D view selection procedure, pinpointing semantically relevant 2D inputs for crucial visual clues. We then integrate this 2D knowledge into the 3D-VQA system via a two-branch Transformer structure. This structure, featuring a Twin-Transformer design, compactly combines 2D and 3D modalities and captures fine-grained correlations between modalities, allowing them mutually augmenting each other. Integrating proposed mechanisms above, we present BridgeQA, that offers a fresh perspective on multi-modal transformer-based architectures for 3D-VQA. Experiments validate that BridgeQA achieves state-of-the-art on 3D-VQA datasets and significantly outperforms existing solutions. Code is available at $\href{https://github.com/matthewdm0816/BridgeQA}{\text{this URL}}$.

PDF Abstract

Code

Add Remove Mark official

matthewdm0816/bridgeqa official

Tasks

Add Remove

3D Question Answering (3D-QA)

Question Answering

Visual Question Answering

Visual Question Answering (VQA)

Datasets

Add Datasets introduced or used in this paper

Results from the Paper

Add Remove

Ranked #1 on 3D Question Answering (3D-QA) on ScanQA Test w/ objects

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
3D Question Answering (3D-QA)	ScanQA Test w/ objects	BridgeQA	Exact Match	31.29	# 1	Compare
			BLEU-1	34.49	# 4	Compare
			BLEU-4	24.06	# 1	Compare
			ROUGE	43.26	# 1	Compare
			METEOR	16.51	# 2	Compare
			CIDEr	83.75	# 1	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove