Video Instance Segmentation

85 papers with code • 8 benchmarks • 8 datasets

The goal of video instance segmentation is simultaneous detection, segmentation and tracking of instances in videos. In words, it is the first time that the image instance segmentation problem is extended to the video domain.

To facilitate research on this new task, a large-scale benchmark called YouTube-VIS, which consists of 2,883 high-resolution YouTube videos, a 40-category label set and 131k high-quality instance masks is built.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Instance Segmentation

Dataset	Best Model	Compare
YouTube-VIS validation	DVIS++(VIT-L, Offline)	See all
OVIS validation	DVIS++(VIT-L, Offline)	See all
YouTube-VIS 2021	DVIS++(VIT-L, Offline)	See all
Youtube-VIS 2022 Validation	DVIS++(VIT-L)	See all
BDD100K val	PCAN	See all
HQ-YTVIS	VMT (Swin-L)	See all
YouTube-VIS	STC	See all
Youtube-VIS (trained with no video masks)	MaskFreeVIS	See all

Libraries

Use these libraries to find Video Instance Segmentation models and implementations

hustvl/QueryInst

3 papers

400

open-mmlab/mmdetection

2 papers

27,866

open-mmlab/mmtracking

2 papers

3,384

wjf5203/vnext

2 papers

593

See all 7 libraries.

Datasets

Most implemented papers

Most implemented Social Latest No code

SeqFormer: Sequential Transformer for Video Instance Segmentation

wjf5203/SeqFormer • • 15 Dec 2021

Nevertheless, we observe that a stand-alone instance query suffices for capturing a time sequence of instances in a video, but attention mechanisms shall be done with each frame independently.

Paper
Code

RankSeg: Adaptive Pixel Classification with Image Category Ranking for Segmentation

openseg-group/rankseg • • 8 Mar 2022

Given an input image or video, our framework first conducts multi-label classification over the complete label, then sorts the complete label and selects a small subset according to their class confidence scores.

Paper
Code

MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training

nvlabs/minvis • • 3 Aug 2022

By only training a query-based image instance segmentation model, MinVIS outperforms the previous best result on the challenging Occluded VIS dataset by over 10% AP.

Paper
Code

Tube-Link: A Flexible Cross Tube Framework for Universal Video Segmentation

lxtgh/tube-link • • ICCV 2023

Our framework is a near-online approach that takes a short subclip as input and outputs the corresponding spatial-temporal tube masks.

Paper
Code

Spatio-temporal Prompting Network for Robust Video Feature Extraction

guanxiongsun/vfe.pytorch • • ICCV 2023

Then, these video prompts are prepended to the patch embeddings of the current frame as the updated input for video feature extraction.

Paper
Code

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

opengvlab/internvideo2 • 22 Mar 2024

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

Paper
Code

Efficient Video Object Segmentation via Network Modulation

linjieyangsc/video_seg • • CVPR 2018

Video object segmentation targets at segmenting a specific object throughout a video sequence, given only an annotated first frame.

Paper
Code

Instance-wise Depth and Motion Learning from Monocular Videos

SeokjuLee/Insta-DM • • 19 Dec 2019

We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision.

Paper
Code