TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Referring Expression Segmentation	A2D Sentences	SOC (Video-Swin-T)	Precision@0.5	0.79	# 4
Referring Expression Segmentation	A2D Sentences	SOC (Video-Swin-T)	Precision@0.9	0.195	# 4
Referring Expression Segmentation	A2D Sentences	SOC (Video-Swin-T)	IoU overall	0.747	# 4
Referring Expression Segmentation	A2D Sentences	SOC (Video-Swin-T)	IoU mean	0.669	# 4
Referring Expression Segmentation	A2D Sentences	SOC (Video-Swin-T)	Precision@0.6	0.756	# 4
Referring Expression Segmentation	A2D Sentences	SOC (Video-Swin-T)	Precision@0.7	0.687	# 4
Referring Expression Segmentation	A2D Sentences	SOC (Video-Swin-T)	Precision@0.8	0.535	# 4
Referring Expression Segmentation	A2D Sentences	SOC (Video-Swin-T)	AP	0.504	# 4
Referring Expression Segmentation	A2D Sentences	SOC (Video-Swin-B)	Precision@0.5	0.851	# 1
Referring Expression Segmentation	A2D Sentences	SOC (Video-Swin-B)	Precision@0.9	0.252	# 2
Referring Expression Segmentation	A2D Sentences	SOC (Video-Swin-B)	IoU overall	0.807	# 1
Referring Expression Segmentation	A2D Sentences	SOC (Video-Swin-B)	IoU mean	0.725	# 1
Referring Expression Segmentation	A2D Sentences	SOC (Video-Swin-B)	Precision@0.6	0.827	# 1
Referring Expression Segmentation	A2D Sentences	SOC (Video-Swin-B)	Precision@0.7	0.765	# 2
Referring Expression Segmentation	A2D Sentences	SOC (Video-Swin-B)	Precision@0.8	0.607	# 2
Referring Expression Segmentation	A2D Sentences	SOC (Video-Swin-B)	AP	0.573	# 2
Referring Expression Segmentation	J-HMDB	SOC (Video-Swin-B)	Precision@0.5	0.969	# 2
Referring Expression Segmentation	J-HMDB	SOC (Video-Swin-B)	Precision@0.6	0.914	# 2
Referring Expression Segmentation	J-HMDB	SOC (Video-Swin-B)	Precision@0.7	0.711	# 2
Referring Expression Segmentation	J-HMDB	SOC (Video-Swin-B)	Precision@0.8	0.213	# 2
Referring Expression Segmentation	J-HMDB	SOC (Video-Swin-B)	Precision@0.9	0.001	# 5
Referring Expression Segmentation	J-HMDB	SOC (Video-Swin-B)	AP	0.446	# 2
Referring Expression Segmentation	J-HMDB	SOC (Video-Swin-B)	IoU overall	0.736	# 2
Referring Expression Segmentation	J-HMDB	SOC (Video-Swin-B)	IoU mean	0.723	# 2
Referring Expression Segmentation	J-HMDB	SOC (Video-Swin-T)	Precision@0.5	0.947	# 3
Referring Expression Segmentation	J-HMDB	SOC (Video-Swin-T)	Precision@0.6	0.864	# 3
Referring Expression Segmentation	J-HMDB	SOC (Video-Swin-T)	Precision@0.7	0.627	# 3
Referring Expression Segmentation	J-HMDB	SOC (Video-Swin-T)	Precision@0.8	0.179	# 4
Referring Expression Segmentation	J-HMDB	SOC (Video-Swin-T)	Precision@0.9	0.001	# 5
Referring Expression Segmentation	J-HMDB	SOC (Video-Swin-T)	AP	0.397	# 4
Referring Expression Segmentation	J-HMDB	SOC (Video-Swin-T)	IoU overall	0.707	# 3
Referring Expression Segmentation	J-HMDB	SOC (Video-Swin-T)	IoU mean	0.701	# 3
Referring Video Object Segmentation	Refer-YouTube-VOS	SOC	J&F	66.0	# 4
Referring Video Object Segmentation	Refer-YouTube-VOS	SOC	J	64.1	# 4
Referring Video Object Segmentation	Refer-YouTube-VOS	SOC	F	67.9	# 4
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	SOC (Joint training, Video-Swin-B)	J&F	67.3±0.5	# 5
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	SOC (Joint training, Video-Swin-B)	J	65.3	# 5
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	SOC (Joint training, Video-Swin-B)	F	69.3	# 4
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	SOC (Video-Swin-T)	J&F	59.2	# 16
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	SOC (Video-Swin-T)	J	57.8	# 15
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	SOC (Video-Swin-T)	F	60.5	# 15

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/soc-semantic-assisted-object-cluster-for/referring-expression-segmentation-on-a2d)](https://paperswithcode.com/sota/referring-expression-segmentation-on-a2d?p=soc-semantic-assisted-object-cluster-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/soc-semantic-assisted-object-cluster-for/referring-expression-segmentation-on-j-hmdb)](https://paperswithcode.com/sota/referring-expression-segmentation-on-j-hmdb?p=soc-semantic-assisted-object-cluster-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/soc-semantic-assisted-object-cluster-for/referring-video-object-segmentation-on-refer)](https://paperswithcode.com/sota/referring-video-object-segmentation-on-refer?p=soc-semantic-assisted-object-cluster-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/soc-semantic-assisted-object-cluster-for/referring-expression-segmentation-on-refer-1)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refer-1?p=soc-semantic-assisted-object-cluster-for)`

SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

NeurIPS 2023 · Zhuoyan Luo, Yicheng Xiao, Yong liu, Shuyan Li, Yitong Wang, Yansong Tang, Xiu Li, Yujiu Yang ·

This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct well-aligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations. Code will be available.

PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 Abstract

Code

Add Remove Mark official

RobertLuo1/NeurIPS2023_SOC official

Tasks

Add Remove

Object

Referring Expression Segmentation

Referring Video Object Segmentation

Segmentation

Semantic Segmentation

Video Object Segmentation

Video Semantic Segmentation

Datasets

JHMDB

Refer-YouTube-VOS

A2D Sentences

Results from the Paper

Add Remove

Ranked #2 on Referring Expression Segmentation on A2D Sentences (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Referring Expression Segmentation	A2D Sentences	SOC (Video-Swin-T)	Precision@0.5	0.79	# 4	Compare
			Precision@0.9	0.195	# 4	Compare
			IoU overall	0.747	# 4	Compare
			IoU mean	0.669	# 4	Compare
			Precision@0.6	0.756	# 4	Compare
			Precision@0.7	0.687	# 4	Compare
			Precision@0.8	0.535	# 4	Compare
			AP	0.504	# 4	Compare
Referring Expression Segmentation	A2D Sentences	SOC (Video-Swin-B)	Precision@0.5	0.851	# 1	Compare
			Precision@0.9	0.252	# 2	Compare
			IoU overall	0.807	# 1	Compare
			IoU mean	0.725	# 1	Compare
			Precision@0.6	0.827	# 1	Compare
			Precision@0.7	0.765	# 2	Compare
			Precision@0.8	0.607	# 2	Compare
			AP	0.573	# 2	Compare
Referring Expression Segmentation	J-HMDB	SOC (Video-Swin-B)	Precision@0.5	0.969	# 2	Compare
			Precision@0.6	0.914	# 2	Compare
			Precision@0.7	0.711	# 2	Compare
			Precision@0.8	0.213	# 2	Compare
			Precision@0.9	0.001	# 5	Compare
			AP	0.446	# 2	Compare
			IoU overall	0.736	# 2	Compare
			IoU mean	0.723	# 2	Compare
Referring Expression Segmentation	J-HMDB	SOC (Video-Swin-T)	Precision@0.5	0.947	# 3	Compare
			Precision@0.6	0.864	# 3	Compare
			Precision@0.7	0.627	# 3	Compare
			Precision@0.8	0.179	# 4	Compare
			Precision@0.9	0.001	# 5	Compare
			AP	0.397	# 4	Compare
			IoU overall	0.707	# 3	Compare
			IoU mean	0.701	# 3	Compare
Referring Video Object Segmentation	Refer-YouTube-VOS	SOC	J&F	66.0	# 4	Compare
			J	64.1	# 4	Compare
			F	67.9	# 4	Compare
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	SOC (Joint training, Video-Swin-B)	J&F	67.3±0.5	# 5	Compare
			J	65.3	# 5	Compare
			F	69.3	# 4	Compare
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	SOC (Video-Swin-T)	J&F	59.2	# 16	Compare
			J	57.8	# 15	Compare
			F	60.5	# 15	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove