Many of the recent successful methods for video object segmentation (VOS) are overly complicated, heavily rely on fine-tuning on the first frame, and/or are slow, and are hence of limited practical use.
Ranked #1 on Semi-Supervised Video Object Segmentation on YouTube
Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11x less GPU memory usage.
Ranked #22 on Semantic Segmentation on Cityscapes test
Fourth, in order to shed light on the potential of self-supervised learning on the task of video correspondence flow, we probe the upper bound by training on additional data, \ie more diverse videos, further demonstrating significant improvements on video segmentation.
We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view.
Ranked #1 on Unsupervised Video Object Segmentation on YouTube
Specifically, to integrate the insights of matching based and propagation based methods, we employ an encoder-decoder framework to learn pixel-level similarity and segmentation in an end-to-end manner.
In our framework, the past frames with object masks form an external memory, and the current frame as the query is segmented using the mask information in the memory.