TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Referring Image Matting (Keyword-based)	RefMatte	CLIPSeg (ViT-B/16)	SAD	17.75	# 3
Referring Image Matting (Keyword-based)	RefMatte	CLIPSeg (ViT-B/16)	MSE	0.0064	# 3
Referring Image Matting (Keyword-based)	RefMatte	CLIPSeg (ViT-B/16)	MAD	0.0101	# 3
Referring Image Matting (Keyword-based)	RefMatte	CLIPSeg (ViT-B/16)	SAD(E)	18.69	# 3
Referring Image Matting (Keyword-based)	RefMatte	CLIPSeg (ViT-B/16)	MSE(E)	0.0067	# 3
Referring Image Matting (Keyword-based)	RefMatte	CLIPSeg (ViT-B/16)	MAD(E)	0.0106	# 3
Referring Image Matting (RefMatte-RW100)	RefMatte	CLIPSeg (ViT-B/16)	SAD	211.86	# 4
Referring Image Matting (RefMatte-RW100)	RefMatte	CLIPSeg (ViT-B/16)	MSE	0.1178	# 4
Referring Image Matting (RefMatte-RW100)	RefMatte	CLIPSeg (ViT-B/16)	MAD	0.1222	# 4
Referring Image Matting (RefMatte-RW100)	RefMatte	CLIPSeg (ViT-B/16)	SAD(E)	222.37	# 4
Referring Image Matting (RefMatte-RW100)	RefMatte	CLIPSeg (ViT-B/16)	MSE(E)	0.1236	# 4
Referring Image Matting (RefMatte-RW100)	RefMatte	CLIPSeg (ViT-B/16)	MAD(E)	0.1282	# 4
Referring Image Matting (Expression-based)	RefMatte	CLIPSeg (ViT-B/16)	SAD	69.13	# 3
Referring Image Matting (Expression-based)	RefMatte	CLIPSeg (ViT-B/16)	MSE	0.0358	# 3
Referring Image Matting (Expression-based)	RefMatte	CLIPSeg (ViT-B/16)	MAD	0.0394	# 3
Referring Image Matting (Expression-based)	RefMatte	CLIPSeg (ViT-B/16)	SAD(E)	73.53	# 3
Referring Image Matting (Expression-based)	RefMatte	CLIPSeg (ViT-B/16)	MSE(E)	0.0381	# 3
Referring Image Matting (Expression-based)	RefMatte	CLIPSeg (ViT-B/16)	MAD(E)	0.0419	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/prompt-based-multi-modal-image-segmentation/referring-image-matting-keyword-based-on)](https://paperswithcode.com/sota/referring-image-matting-keyword-based-on?p=prompt-based-multi-modal-image-segmentation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/prompt-based-multi-modal-image-segmentation/referring-image-matting-expression-based-on)](https://paperswithcode.com/sota/referring-image-matting-expression-based-on?p=prompt-based-multi-modal-image-segmentation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/prompt-based-multi-modal-image-segmentation/referring-image-matting-refmatte-rw100-on)](https://paperswithcode.com/sota/referring-image-matting-refmatte-rw100-on?p=prompt-based-multi-modal-image-segmentation)`

Image Segmentation Using Text and Image Prompts

CVPR 2022 · Timo Lüddecke, Alexander S. Ecker ·

Image segmentation is usually addressed by training a model for a fixed set of object classes. Incorporating additional classes or more complex queries later is expensive as it requires re-training the model on a dataset that encompasses these expressions. Here we propose a system that can generate image segmentations based on arbitrary prompts at test time. A prompt can be either a text or an image. This approach enables us to create a unified model (trained once) for three common segmentation tasks, which come with distinct challenges: referring expression segmentation, zero-shot segmentation and one-shot segmentation. We build upon the CLIP model as a backbone which we extend with a transformer-based decoder that enables dense prediction. After training on an extended version of the PhraseCut dataset, our system generates a binary segmentation map for an image based on a free-text prompt or on an additional image expressing the query. We analyze different variants of the latter image-based prompts in detail. This novel hybrid input allows for dynamic adaptation not only to the three segmentation tasks mentioned above, but to any binary segmentation task where a text or image query can be formulated. Finally, we find our system to adapt well to generalized queries involving affordances or properties. Code is available at https://eckerlab.org/code/clipseg.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

timojl/clipseg official

1,012

huggingface/transformers

125,167

casia-iva-lab/fastsam

↳ Quickstart in

Colab

Spaces

Replicate

6,847

openrobotlab/ov_parts

Tasks

Add Remove

Image Segmentation

Multi-modal image segmentation

One-Shot Segmentation

Referring Expression

Referring Expression Segmentation

Referring Image Matting (Expression-based)

Referring Image Matting (Keyword-based)

Referring Image Matting (RefMatte-RW100)

Segmentation

Semantic Segmentation

Zero Shot Segmentation

Datasets

ImageNet

LVIS

PASCAL-5i

PhraseCut

RefMatte

Results from the Paper

Edit

Ranked #3 on Referring Image Matting (Keyword-based) on RefMatte

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Referring Image Matting (Keyword-based)	RefMatte	CLIPSeg (ViT-B/16)	SAD	17.75	# 3	Compare
			MSE	0.0064	# 3	Compare
			MAD	0.0101	# 3	Compare
			SAD(E)	18.69	# 3	Compare
			MSE(E)	0.0067	# 3	Compare
			MAD(E)	0.0106	# 3	Compare
Referring Image Matting (RefMatte-RW100)	RefMatte	CLIPSeg (ViT-B/16)	SAD	211.86	# 4	Compare
			MSE	0.1178	# 4	Compare
			MAD	0.1222	# 4	Compare
			SAD(E)	222.37	# 4	Compare
			MSE(E)	0.1236	# 4	Compare
			MAD(E)	0.1282	# 4	Compare
Referring Image Matting (Expression-based)	RefMatte	CLIPSeg (ViT-B/16)	SAD	69.13	# 3	Compare
			MSE	0.0358	# 3	Compare
			MAD	0.0394	# 3	Compare
			SAD(E)	73.53	# 3	Compare
			MSE(E)	0.0381	# 3	Compare
			MAD(E)	0.0419	# 3	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Image Segmentation Using Text and Image Prompts

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove