TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Retrieval	OK-VQA	FLMR	Recall@5	89.32	# 1
Visual Question Answering (VQA)	OK-VQA	RA-VQA-v2 (T5-large)	Accuracy	54.85	# 13
Visual Question Answering (VQA)	OK-VQA	RA-VQA-v2 (BLIP 2)	Accuracy	62.08	# 6
Visual Question Answering (VQA)	OK-VQA	RA-VQA-v2 (BLIP 2)	Exact Match (EM)	62.01	# 1
Visual Question Answering (VQA)	OK-VQA	RA-VQA-v2 (BLIP 2)	Recall@5	89.32	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/fine-grained-late-interaction-multi-modal-1/retrieval-on-ok-vqa)](https://paperswithcode.com/sota/retrieval-on-ok-vqa?p=fine-grained-late-interaction-multi-modal-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/fine-grained-late-interaction-multi-modal-1/visual-question-answering-on-ok-vqa)](https://paperswithcode.com/sota/visual-question-answering-on-ok-vqa?p=fine-grained-late-interaction-multi-modal-1)`

Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

NeurIPS 2023 · Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, Bill Byrne ·

Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong framework to tackle KB-VQA, first retrieves related documents with Dense Passage Retrieval (DPR) and then uses them to answer questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major limitations in RA-VQA's retriever: (1) the image representations obtained via image-to-text transforms can be incomplete and inaccurate and (2) relevance scores between queries and documents are computed with one-dimensional embeddings, which can be insensitive to finer-grained relevance. FLMR overcomes these limitations by obtaining image representations that complement those from the image-to-text transforms using a vision model aligned with an existing text-based retriever through a simple alignment network. FLMR also encodes images and questions using multi-dimensional embeddings to capture finer-grained relevance between queries and documents. FLMR significantly improves the original RA-VQA retriever's PRRecall@5 by approximately 8\%. Finally, we equipped RA-VQA with two state-of-the-art large multi-modal/language models to achieve $\sim61\%$ VQA score in the OK-VQA dataset.

PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 Abstract

Code

Add Remove Mark official

linweizhedragon/retrieval-augmented… official

106

Tasks

Add Remove

Passage Retrieval

Question Answering

Retrieval

Visual Question Answering

Visual Question Answering (VQA)

Datasets

OK-VQA

InfoSeek

Results from the Paper

Add Remove

Ranked #1 on Retrieval on OK-VQA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Retrieval	OK-VQA	FLMR	Recall@5	89.32	# 1	Compare
Visual Question Answering (VQA)	OK-VQA	RA-VQA-v2 (T5-large)	Accuracy	54.85	# 13	Compare
Visual Question Answering (VQA)	OK-VQA	RA-VQA-v2 (BLIP 2)	Accuracy	62.08	# 6	Compare
			Exact Match (EM)	62.01	# 1	Compare
			Recall@5	89.32	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove