TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Long-tail Learning	ImageNet-LT	MAM (ViT-B/16)	Top-1 Accuracy	82.3	# 2
Long-tail Learning	Places-LT	MAM (ViT-B/16)	Top-1 Accuracy	51.4	# 2
Image Classification	WebVision-1000	MAM (ViT-B/16)	Top-1 Accuracy	83.6	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improving-image-recognition-by-retrieving/image-classification-on-webvision-1000)](https://paperswithcode.com/sota/image-classification-on-webvision-1000?p=improving-image-recognition-by-retrieving)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improving-image-recognition-by-retrieving/long-tail-learning-on-imagenet-lt)](https://paperswithcode.com/sota/long-tail-learning-on-imagenet-lt?p=improving-image-recognition-by-retrieving)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improving-image-recognition-by-retrieving/long-tail-learning-on-places-lt)](https://paperswithcode.com/sota/long-tail-learning-on-places-lt?p=improving-image-recognition-by-retrieving)`

Improving Image Recognition by Retrieving from Web-Scale Image-Text Data

CVPR 2023 · Ahmet Iscen, Alireza Fathi, Cordelia Schmid ·

Retrieval augmented models are becoming increasingly popular for computer vision tasks after their recent success in NLP problems. The goal is to enhance the recognition capabilities of the model by retrieving similar examples for the visual input from an external memory set. In this work, we introduce an attention-based memory module, which learns the importance of each retrieved example from the memory. Compared to existing approaches, our method removes the influence of the irrelevant retrieved examples, and retains those that are beneficial to the input query. We also thoroughly study various ways of constructing the memory dataset. Our experiments show the benefit of using a massive-scale memory dataset of 1B image-text pairs, and demonstrate the performance of different memory representations. We evaluate our method in three different classification tasks, namely long-tailed recognition, learning with noisy labels, and fine-grained classification, and show that it achieves state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Image Classification

Learning with noisy labels

Long-tail Learning

Datasets

ImageNet

Places ImageNet-LT

WebVision Places-LT JFT-3B

Results from the Paper

Add Remove

Ranked #1 on Image Classification on WebVision-1000 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Long-tail Learning	ImageNet-LT	MAM (ViT-B/16)	Top-1 Accuracy	82.3	# 2	Compare
Long-tail Learning	Places-LT	MAM (ViT-B/16)	Top-1 Accuracy	51.4	# 2	Compare
Image Classification	WebVision-1000	MAM (ViT-B/16)	Top-1 Accuracy	83.6	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Improving Image Recognition by Retrieving from Web-Scale Image-Text Data

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove