TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Generalized Referring Expression Segmentation	gRefCOCO	VLT	gIoU	52.00	# 6
Generalized Referring Expression Segmentation	gRefCOCO	VLT	cIoU	52.51	# 4
Generalized Referring Expression Comprehension	gRefCOCO	VLT	Precision@(F1=1, IoU≥0.5)	36.6	# 3
Generalized Referring Expression Comprehension	gRefCOCO	VLT	N-acc.	35.2	# 3
Referring Expression Segmentation	RefCOCOg-test	VLT (Darknet53)	Overall IoU	56.65	# 8
Referring Expression Segmentation	RefCOCOg-val	VLT (Darknet53)	Overall IoU	52.99	# 11
Referring Expression Segmentation	RefCOCO testA	VLT	Overall IoU	68.29	# 16
Referring Expression Segmentation	RefCOCO+ testA	VLT	Overall IoU	59.20	# 13
Referring Expression Segmentation	RefCOCO testB	VLT	Overall IoU	62.73	# 12
Referring Expression Segmentation	RefCOCO+ test B	VLT	Overall IoU	49.36	# 13
Referring Expression Segmentation	RefCoCo val	VLT	Overall IoU	65.65	# 16
Referring Expression Segmentation	RefCOCO+ val	VLT	Overall IoU	55.50	# 15

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-language-transformer-and-query/generalized-referring-expression)](https://paperswithcode.com/sota/generalized-referring-expression?p=vision-language-transformer-and-query)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-language-transformer-and-query/generalized-referring-expression-segmentation)](https://paperswithcode.com/sota/generalized-referring-expression-segmentation?p=vision-language-transformer-and-query)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-language-transformer-and-query/referring-expression-segmentation-on-refcocog-1)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcocog-1?p=vision-language-transformer-and-query)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-language-transformer-and-query/referring-expression-segmentation-on-refcocog)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcocog?p=vision-language-transformer-and-query)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-language-transformer-and-query/referring-expression-segmentation-on-refcoco-2)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcoco-2?p=vision-language-transformer-and-query)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-language-transformer-and-query/referring-expression-segmentation-on-refcoco-4)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcoco-4?p=vision-language-transformer-and-query)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-language-transformer-and-query/referring-expression-segmentation-on-refcoco-5)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcoco-5?p=vision-language-transformer-and-query)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-language-transformer-and-query/referring-expression-segmentation-on-refcoco-3)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcoco-3?p=vision-language-transformer-and-query)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-language-transformer-and-query/referring-expression-segmentation-on-refcoco-1)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcoco-1?p=vision-language-transformer-and-query)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-language-transformer-and-query/referring-expression-segmentation-on-refcoco)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcoco?p=vision-language-transformer-and-query)`

Vision-Language Transformer and Query Generation for Referring Segmentation

ICCV 2021 · Henghui Ding, Chang Liu, Suchen Wang, Xudong Jiang ·

In this work, we address the challenging task of referring segmentation. The query expression in referring segmentation typically indicates the target object by describing its relationship with others. Therefore, to find the target one among all instances in the image, the model must have a holistic understanding of the whole image. To achieve this, we reformulate referring segmentation as a direct attention problem: finding the region in the image where the query language expression is most attended to. We introduce transformer and multi-head attention to build a network with an encoder-decoder attention mechanism architecture that "queries" the given image with the language expression. Furthermore, we propose a Query Generation Module, which produces multiple sets of queries with different attention weights that represent the diversified comprehensions of the language expression from different aspects. At the same time, to find the best way from these diversified comprehensions based on visual clues, we further propose a Query Balance Module to adaptively select the output features of these queries for a better mask generation. Without bells and whistles, our approach is light-weight and achieves new state-of-the-art performance consistently on three referring segmentation datasets, RefCOCO, RefCOCO+, and G-Ref. Our code is available at https://github.com/henghuiding/Vision-Language-Transformer.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Code

Add Remove Mark official

henghuiding/Vision-Language-Transfo… official

334

Tasks

Add Remove

Generalized Referring Expression Comprehension

Generalized Referring Expression Segmentation

Referring Expression Segmentation

Segmentation

Datasets

RefCOCO Google Refexp

gRefCOCO

Results from the Paper

Edit

Ranked #3 on Generalized Referring Expression Comprehension on gRefCOCO

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Generalized Referring Expression Segmentation	gRefCOCO	VLT	gIoU	52.00	# 6	Compare
Generalized Referring Expression Segmentation	gRefCOCO	VLT	cIoU	52.51	# 4	Compare
Generalized Referring Expression Comprehension	gRefCOCO	VLT	Precision@(F1=1, IoU≥0.5)	36.6	# 3	Compare
Generalized Referring Expression Comprehension	gRefCOCO	VLT	N-acc.	35.2	# 3	Compare
Referring Expression Segmentation	RefCOCOg-test	VLT (Darknet53)	Overall IoU	56.65	# 8	Compare
Referring Expression Segmentation	RefCOCOg-val	VLT (Darknet53)	Overall IoU	52.99	# 11	Compare
Referring Expression Segmentation	RefCOCO testA	VLT	Overall IoU	68.29	# 16	Compare
Referring Expression Segmentation	RefCOCO+ testA	VLT	Overall IoU	59.20	# 13	Compare
Referring Expression Segmentation	RefCOCO testB	VLT	Overall IoU	62.73	# 12	Compare
Referring Expression Segmentation	RefCOCO+ test B	VLT	Overall IoU	49.36	# 13	Compare
Referring Expression Segmentation	RefCoCo val	VLT	Overall IoU	65.65	# 16	Compare
Referring Expression Segmentation	RefCOCO+ val	VLT	Overall IoU	55.50	# 15	Compare

Methods

Add Remove

Linear Layer • Softmax

Edit Social Preview

Vision-Language Transformer and Query Generation for Referring Segmentation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove