TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Referring Expression Segmentation	A2D Sentences	mmmmtbvs	Precision@0.5	0.645	# 14
Referring Expression Segmentation	A2D Sentences	mmmmtbvs	Precision@0.9	0.13	# 10
Referring Expression Segmentation	A2D Sentences	mmmmtbvs	IoU overall	0.673	# 11
Referring Expression Segmentation	A2D Sentences	mmmmtbvs	IoU mean	0.558	# 14
Referring Expression Segmentation	A2D Sentences	mmmmtbvs	Precision@0.6	0.597	# 12
Referring Expression Segmentation	A2D Sentences	mmmmtbvs	Precision@0.7	0.523	# 11
Referring Expression Segmentation	A2D Sentences	mmmmtbvs	Precision@0.8	0.375	# 10
Referring Expression Segmentation	A2D Sentences	mmmmtbvs	AP	0.419	# 10

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/modeling-motion-with-multi-modal-features-for/referring-expression-segmentation-on-a2d)](https://paperswithcode.com/sota/referring-expression-segmentation-on-a2d?p=modeling-motion-with-multi-modal-features-for)`

Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation

CVPR 2022 · Wangbo Zhao, Kai Wang, Xiangxiang Chu, Fuzhao Xue, Xinchao Wang, Yang You ·

Text-based video segmentation aims to segment the target object in a video based on a describing sentence. Incorporating motion information from optical flow maps with appearance and linguistic modalities is crucial yet has been largely ignored by previous work. In this paper, we design a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation. Specifically, we propose a multi-modal video transformer, which can fuse and aggregate multi-modal and temporal features between frames. Furthermore, we design a language-guided feature fusion module to progressively fuse appearance and motion features in each feature level with guidance from linguistic features. Finally, a multi-modal alignment loss is proposed to alleviate the semantic gap between features from different modalities. Extensive experiments on A2D Sentences and J-HMDB Sentences verify the performance and the generalization ability of our method compared to the state-of-the-art methods.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

wangbo-zhao/2022cvpr-mmmmtbvs official

Tasks

Add Remove

Optical Flow Estimation

Referring Expression Segmentation

Segmentation

Sentence

Video Segmentation

Video Semantic Segmentation

Datasets

JHMDB

A2D

A2D Sentences

Results from the Paper

Edit

Ranked #10 on Referring Expression Segmentation on A2D Sentences

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Referring Expression Segmentation	A2D Sentences	mmmmtbvs	Precision@0.5	0.645	# 14	Compare
			Precision@0.9	0.13	# 10	Compare
			IoU overall	0.673	# 11	Compare
			IoU mean	0.558	# 14	Compare
			Precision@0.6	0.597	# 12	Compare
			Precision@0.7	0.523	# 11	Compare
			Precision@0.8	0.375	# 10	Compare
			AP	0.419	# 10	Compare

Methods

Add Remove

ALIGN

Edit Social Preview

Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove