TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Open Vocabulary Object Detection	LVIS v1.0	CLIM (RN50x64)	AP novel-LVIS base training	32.3	# 6
Open Vocabulary Object Detection	MSCOCO	CLIM (RN50)	AP 0.5	36.9	# 12

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clim-contrastive-language-image-mosaic-for/open-vocabulary-object-detection-on-lvis-v1-0)](https://paperswithcode.com/sota/open-vocabulary-object-detection-on-lvis-v1-0?p=clim-contrastive-language-image-mosaic-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clim-contrastive-language-image-mosaic-for/open-vocabulary-object-detection-on-mscoco)](https://paperswithcode.com/sota/open-vocabulary-object-detection-on-mscoco?p=clim-contrastive-language-image-mosaic-for)`

CLIM: Contrastive Language-Image Mosaic for Region Representation

18 Dec 2023 · Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Wentao Liu, Chen Change Loy ·

Detecting objects accurately from a large or open vocabulary necessitates the vision-language alignment on region representations. However, learning such a region-text alignment by obtaining high-quality box annotations with text labels or descriptions is expensive and infeasible. In contrast, collecting image-text pairs is simpler but lacks precise object location information to associate regions with texts. In this paper, we propose a novel approach called Contrastive Language-Image Mosaic (CLIM), which leverages large-scale image-text pairs effectively for aligning region and text representations. CLIM combines multiple images into a mosaicked image and treats each image as a `pseudo region'. The feature of each pseudo region is extracted and trained to be similar to the corresponding text embedding while dissimilar from others by a contrastive loss, enabling the model to learn the region-text alignment without costly box annotations. As a generally applicable approach, CLIM consistently improves different open-vocabulary object detection methods that use caption supervision. Furthermore, CLIM can effectively enhance the region representation of vision-language models, thus providing stronger backbones for open-vocabulary object detectors. Our experimental results demonstrate that CLIM improves different baseline open-vocabulary object detectors by a large margin on both OV-COCO and OV-LVIS benchmarks. The code is available at https://github.com/wusize/CLIM.

PDF Abstract

Code

Add Remove Mark official

wusize/clim official

Tasks

Add Remove

Object

object-detection

Object Detection

Open Vocabulary Object Detection

Datasets

MS COCO

LVIS MSCOCO

Results from the Paper

Add Remove

Ranked #6 on Open Vocabulary Object Detection on LVIS v1.0

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Open Vocabulary Object Detection	LVIS v1.0	CLIM (RN50x64)	AP novel-LVIS base training	32.3	# 6	Compare
Open Vocabulary Object Detection	MSCOCO	CLIM (RN50)	AP 0.5	36.9	# 12	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

CLIM: Contrastive Language-Image Mosaic for Region Representation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove