Putting the Object Back into Video Object Segmentation

19 Oct 2023  ยท  Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, Alexander Schwing ยท

We present Cutie, a video object segmentation (VOS) network with object-level memory reading, which puts the object representation from memory back into the video object segmentation result. Recent works on VOS employ bottom-up pixel-level memory reading which struggles due to matching noise, especially in the presence of distractors, resulting in lower performance in more challenging data. In contrast, Cutie performs top-down object-level memory reading by adapting a small set of object queries. Via those, it interacts with the bottom-up pixel features iteratively with a query-based object transformer (qt, hence Cutie). The object queries act as a high-level summary of the target object, while high-resolution feature maps are retained for accurate segmentation. Together with foreground-background masked attention, Cutie cleanly separates the semantics of the foreground object from the background. On the challenging MOSE dataset, Cutie improves by 8.7 J&F over XMem with a similar running time and improves by 4.2 J&F over DeAOT while being three times faster. Code is available at: https://hkchengrex.github.io/Cutie

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Semi-Supervised Video Object Segmentation BURST-test Cutie (base, MEGA, 600 pixels) HOTA (all) 66.0 # 1
HOTA (common) 66.5 # 1
HOTA (uncommon) 65.9 # 1
Semi-Supervised Video Object Segmentation BURST-test Cutie (base, with mose, 600 pixels) HOTA (all) 62.6 # 2
HOTA (common) 63.8 # 2
HOTA (uncommon) 62.3 # 2
Semi-Supervised Video Object Segmentation BURST-val Cutie (base, MEGA, 600 pixels) HOTA (all) 61.2 # 1
HOTA (common) 65.0 # 1
HOTA (uncommon) 60.3 # 1
Semi-Supervised Video Object Segmentation BURST-val Cutie (base, with mose, 600 pixels) HOTA (all) 58.4 # 2
HOTA (common) 61.8 # 2
HOTA (uncommon) 57.5 # 2
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) Cutie+ (base) J&F 85.9 # 3
Jaccard (Mean) 82.6 # 2
F-measure (Mean) 89.2 # 3
FPS 17.9 # 14
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) Cutie (base, MEGA) J&F 86.1 # 2
Jaccard (Mean) 82.4 # 3
F-measure (Mean) 89.9 # 2
FPS 36.4 # 6
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) Cutie+ (base, MEGA) J&F 88.1 # 1
Jaccard (Mean) 84.7 # 1
F-measure (Mean) 91.4 # 1
FPS 17.9 # 14
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) Cutie+ (base, MEGA) Jaccard (Mean) 85.5 # 5
F-measure (Mean) 90.8 # 9
J&F 88.1 # 7
Speed (FPS) 17.9 # 23
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) Cutie (base) Jaccard (Mean) 84.6 # 7
F-measure (Mean) 91.1 # 6
J&F 87.9 # 8
Params(M) 36.4 # 17
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) Cutie+ (base) Jaccard (Mean) 87.5 # 1
F-measure (Mean) 93.4 # 1
J&F 90.5 # 1
Params(M) 17.9 # 15
Semi-Supervised Video Object Segmentation MOSE Cutie (small) J&F 62.2 # 9
J 58.2 # 9
F 66.2 # 9
FPS 45.5 # 1
Semi-Supervised Video Object Segmentation MOSE Cutie (small, MEGA) J&F 68.6 # 4
J 64.3 # 4
F 72.9 # 4
FPS 45.5 # 1
Semi-Supervised Video Object Segmentation MOSE Cutie (base, MEGA) J&F 69.9 # 3
J 65.8 # 3
F 74.1 # 3
FPS 36.4 # 4
Semi-Supervised Video Object Segmentation MOSE Cutie+ (small, MEGA) J&F 70.3 # 2
J 66.0 # 2
F 74.5 # 2
FPS 20.6 # 9
Semi-Supervised Video Object Segmentation MOSE Cutie (base) J&F 64.0 # 8
J 60.0 # 8
F 67.9 # 8
FPS 36.4 # 4
Semi-Supervised Video Object Segmentation MOSE Cutie+ (base, MEGA) J&F 71.7 # 1
J 67.6 # 1
F 75.8 # 1
FPS 17.9 # 10
Semi-Supervised Video Object Segmentation MOSE Cutie (base, with mose) J&F 68.3 # 5
J 64.2 # 5
F 72.3 # 5
FPS 36.4 # 4
Semi-Supervised Video Object Segmentation MOSE Cutie (small, with mose) J&F 67.4 # 6
J 63.1 # 6
F 71.7 # 6
FPS 45.5 # 1
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 Cutie+ (base, MEGA) F-Measure (Seen) 91.0 # 1
F-Measure (Unseen) 90.1 # 2
Overall 87.5 # 1
Jaccard (Seen) 86.6 # 1
Jaccard (Unseen) 82.2 # 1
Speed (FPS) 17.9 # 9
Semi-Supervised Video Object Segmentation YouTube-VOS 2019 Cutie+ (base, MEGA) Overall 87.5 # 1
Jaccard (Seen) 86.3 # 1
Jaccard (Unseen) 82.7 # 3
F-Measure (Seen) 90.6 # 1
F-Measure (Unseen) 90.5 # 1
J&F 17.9 # 3

Methods