Video Instance Segmentation
85 papers with code • 8 benchmarks • 8 datasets
The goal of video instance segmentation is simultaneous detection, segmentation and tracking of instances in videos. In words, it is the first time that the image instance segmentation problem is extended to the video domain.
To facilitate research on this new task, a large-scale benchmark called YouTube-VIS, which consists of 2,883 high-resolution YouTube videos, a 40-category label set and 131k high-quality instance masks is built.
Libraries
Use these libraries to find Video Instance Segmentation models and implementationsDatasets
Most implemented papers
SeqFormer: Sequential Transformer for Video Instance Segmentation
Nevertheless, we observe that a stand-alone instance query suffices for capturing a time sequence of instances in a video, but attention mechanisms shall be done with each frame independently.
RankSeg: Adaptive Pixel Classification with Image Category Ranking for Segmentation
Given an input image or video, our framework first conducts multi-label classification over the complete label, then sorts the complete label and selects a small subset according to their class confidence scores.
MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training
By only training a query-based image instance segmentation model, MinVIS outperforms the previous best result on the challenging Occluded VIS dataset by over 10% AP.
Tube-Link: A Flexible Cross Tube Framework for Universal Video Segmentation
Our framework is a near-online approach that takes a short subclip as input and outputs the corresponding spatial-temporal tube masks.
Spatio-temporal Prompting Network for Robust Video Feature Extraction
Then, these video prompts are prepended to the patch embeddings of the current frame as the updated input for video feature extraction.
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.
Efficient Video Object Segmentation via Network Modulation
Video object segmentation targets at segmenting a specific object throughout a video sequence, given only an annotated first frame.
Instance-wise Depth and Motion Learning from Monocular Videos
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision.
Learning a Spatio-Temporal Embedding for Video Instance Segmentation
We present a novel embedding approach for video instance segmentation.
STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos
In this paper, we propose a different approach that is well-suited to a variety of tasks involving instance segmentation in videos.