TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Referring Video Object Segmentation	MeViS	DsHmp	J&F	46.4	# 1
Referring Video Object Segmentation	MeViS	DsHmp	J	43	# 1
Referring Video Object Segmentation	MeViS	DsHmp	F	49.8	# 1
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	DsHmp (Video-Swin-Base)	J&F	67.1	# 6
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	DsHmp (Video-Swin-Base)	J	65	# 6
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	DsHmp (Video-Swin-Base)	F	69.1	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/decoupling-static-and-hierarchical-motion/referring-video-object-segmentation-on-mevis)](https://paperswithcode.com/sota/referring-video-object-segmentation-on-mevis?p=decoupling-static-and-hierarchical-motion)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/decoupling-static-and-hierarchical-motion/referring-expression-segmentation-on-refer-1)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refer-1?p=decoupling-static-and-hierarchical-motion)`

Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation

4 Apr 2024 · Shuting He, Henghui Ding ·

Referring video segmentation relies on natural language expressions to identify and segment objects, often emphasizing motion clues. Previous works treat a sentence as a whole and directly perform identification at the video-level, mixing up static image-level cues with temporal motion cues. However, image-level features cannot well comprehend motion cues in sentences, and static cues are not crucial for temporal perception. In fact, static cues can sometimes interfere with temporal perception by overshadowing motion cues. In this work, we propose to decouple video-level referring expression understanding into static and motion perception, with a specific emphasis on enhancing temporal comprehension. Firstly, we introduce an expression-decoupling module to make static cues and motion cues perform their distinct role, alleviating the issue of sentence embeddings overlooking motion cues. Secondly, we propose a hierarchical motion perception module to capture temporal information effectively across varying timescales. Furthermore, we employ contrastive learning to distinguish the motions of visually similar objects. These contributions yield state-of-the-art performance across five datasets, including a remarkable $\textbf{9.2%}$ $\mathcal{J\&F}$ improvement on the challenging $\textbf{MeViS}$ dataset. Code is available at https://github.com/heshuting555/DsHmp.

PDF Abstract

Code

Add Remove Mark official

heshuting555/dshmp official

Tasks

Add Remove

Contrastive Learning

Referring Expression

Referring Expression Segmentation

Referring Video Object Segmentation

Sentence

Sentence Embeddings

Video Segmentation

Video Semantic Segmentation

Datasets

Refer-YouTube-VOS

MeViS

Results from the Paper

Edit

Ranked #1 on Referring Video Object Segmentation on MeViS

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Referring Video Object Segmentation	MeViS	DsHmp	J&F	46.4	# 1	Compare
			J	43	# 1	Compare
			F	49.8	# 1	Compare
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	DsHmp (Video-Swin-Base)	J&F	67.1	# 6	Compare
			J	65	# 6	Compare
			F	69.1	# 6	Compare

Methods

Add Remove

Contrastive Learning

Edit Social Preview

Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove