TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Referring Expression Segmentation	A2D Sentences	MANET	Precision@0.5	0.734	# 6
Referring Expression Segmentation	A2D Sentences	MANET	Precision@0.9	0.132	# 9
Referring Expression Segmentation	A2D Sentences	MANET	IoU overall	0.726	# 5
Referring Expression Segmentation	A2D Sentences	MANET	IoU mean	0.632	# 7
Referring Expression Segmentation	A2D Sentences	MANET	Precision@0.6	0.682	# 7
Referring Expression Segmentation	A2D Sentences	MANET	Precision@0.7	0.579	# 9
Referring Expression Segmentation	A2D Sentences	MANET	Precision@0.8	0.389	# 9
Referring Expression Segmentation	A2D Sentences	MANET	AP	0.471	# 5
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	MANET	J&F	55.63	# 19
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	MANET	J	54.75	# 19
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	MANET	F	56.51	# 20

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-attention-network-for-compressed-video/referring-expression-segmentation-on-a2d)](https://paperswithcode.com/sota/referring-expression-segmentation-on-a2d?p=multi-attention-network-for-compressed-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-attention-network-for-compressed-video/referring-expression-segmentation-on-refer-1)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refer-1?p=multi-attention-network-for-compressed-video)`

Multi-Attention Network for Compressed Video Referring Object Segmentation

26 Jul 2022 · Weidong Chen, Dexiang Hong, Yuankai Qi, Zhenjun Han, Shuhui Wang, Laiyun Qing, Qingming Huang, Guorong Li ·

Referring video object segmentation aims to segment the object referred by a given language expression. Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented, which increases computation and storage requirements and ultimately slows the inference down. This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones. To alleviate this problem, in this paper, we explore the referring object segmentation task on compressed videos, namely on the original video data flow. Besides the inherent difficulty of the video referring object segmentation task itself, obtaining discriminative representation from compressed video is also rather challenging. To address this problem, we propose a multi-attention network which consists of dual-path dual-attention module and a query-based cross-modal Transformer module. Specifically, the dual-path dual-attention module is designed to extract effective representation from compressed data in three modalities, i.e., I-frame, Motion Vector and Residual. The query-based cross-modal Transformer firstly models the correlation between linguistic and visual modalities, and then the fused multi-modality features are used to guide object queries to generate a content-aware dynamic kernel and to predict final segmentation masks. Different from previous works, we propose to learn just one kernel, which thus removes the complicated post mask-matching procedure of existing methods. Extensive promising experimental results on three challenging datasets show the effectiveness of our method compared against several state-of-the-art methods which are proposed for processing RGB data. Source code is available at: https://github.com/DexiangHong/MANet.

PDF Abstract

Code

Add Remove Mark official

dexianghong/manet official

Tasks

Add Remove

Object

Referring Expression Segmentation

Referring Video Object Segmentation

Segmentation

Semantic Segmentation

Video Object Segmentation

Video Semantic Segmentation

Datasets

A2D

Refer-YouTube-VOS

A2D Sentences

Results from the Paper

Edit

Ranked #5 on Referring Expression Segmentation on A2D Sentences

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Referring Expression Segmentation	A2D Sentences	MANET	Precision@0.5	0.734	# 6	Compare
			Precision@0.9	0.132	# 9	Compare
			IoU overall	0.726	# 5	Compare
			IoU mean	0.632	# 7	Compare
			Precision@0.6	0.682	# 7	Compare
			Precision@0.7	0.579	# 9	Compare
			Precision@0.8	0.389	# 9	Compare
			AP	0.471	# 5	Compare
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	MANET	J&F	55.63	# 19	Compare
			J	54.75	# 19	Compare
			F	56.51	# 20	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Attention Network • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Multi-Attention Network for Compressed Video Referring Object Segmentation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove