General Object Foundation Model for Images and Videos at Scale

14 Dec 2023  Â·  Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai ·

We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations, excelling in zero-shot transfer to new data and tasks. Specifically, we employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks, GLEE exhibits remarkable versatility and improved generalization performance, efficiently tackling downstream tasks without the need for task-specific adaptation. By integrating large volumes of automatically labeled data, we further enhance its zero-shot generalization capabilities. Additionally, GLEE is capable of being integrated into Large Language Models, serving as a foundational model to provide universal object-level information for multi-modal tasks. We hope that the versatility and universality of our method will mark a significant step in the development of efficient visual foundation models for AGI systems. The model and code will be released at https://glee-vision.github.io .

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Long-tail Video Object Segmentation BURST GLEE-Lite HOTA (all) 22.6 # 1
mAP (all) 12.6 # 1
HOTA (com) 36.4 # 1
mAP (com) 18.9 # 1
HOTA (unc) 19.1 # 1
mAP (unc) 11.0 # 1
Long-tail Video Object Segmentation BURST-val GLEE-Lite HOTA (all) 22.6 # 3
mAP (all) 12.6 # 3
HOTA (com) 36.4 # 3
mAP (com) 18.9 # 3
HOTA (unc) 19.1 # 3
mAP (unc) 11.0 # 3
Long-tail Video Object Segmentation BURST-val GLEE-Plus HOTA (all) 26.9 # 2
mAP (all) 17.2 # 2
HOTA (com) 38.8 # 2
mAP (com) 23.7 # 2
HOTA (unc) 23.9 # 2
mAP (unc) 15.5 # 2
Long-tail Video Object Segmentation BURST-val GLEE-Pro HOTA (all) 31.2 # 1
mAP (all) 19.2 # 1
HOTA (com) 48.7 # 1
mAP (com) 24.8 # 1
HOTA (unc) 26.9 # 1
mAP (unc) 17.7 # 1
Instance Segmentation COCO minival GLEE-Lite mask AP 48.4 # 35
Object Detection COCO minival GLEE-Lite box AP 55.0 # 50
Object Detection COCO minival GLEE-Plus box AP 60.4 # 21
Object Detection COCO minival GLEE-Pro box AP 62.0 # 14
Instance Segmentation COCO minival GLEE-Pro mask AP 54.2 # 5
Instance Segmentation COCO minival GLEE-Plus mask AP 53.0 # 9
Instance Segmentation COCO test-dev GLEE-Pro mask AP 54.5 # 6
Object Detection COCO test-dev GLEE-Plus box mAP 60.6 # 24
Object Detection COCO test-dev GLEE-Lite box mAP 54.7 # 49
Object Detection COCO test-dev GLEE-Pro box mAP 62.3 # 20
Instance Segmentation COCO test-dev GLEE-Lite mask AP 48.3 # 27
Instance Segmentation COCO test-dev GLEE-Plus mask AP 53.3 # 9
Object Detection LVIS v1.0 val GLEE-Pro box AP 55.7 # 5
Instance Segmentation LVIS v1.0 val GLEE-Pro mask AP 49.9 # 4
Video Instance Segmentation OVIS validation GLEE-Pro mask AP 50.4 # 2
AP75 55.5 # 2
Referring Expression Comprehension RefCoco+ GLEE-Pro Val 82.6 # 6
Referring Expression Comprehension RefCOCO GLEE-Pro Val 91.0 # 4
Referring Expression Segmentation RefCOCO GLEE-Pro IoU 80.0 # 1
Referring Expression Segmentation RefCOCOg-val GLEE-Pro Overall IoU 72.9 # 4
Referring Expression Comprehension RefCOCOg-val GLEE-Pro Accuracy 86.4 # 5
Referring Expression Segmentation RefCoCo val GLEE-Pro Overall IoU 80.0 # 4
Referring Expression Segmentation RefCOCO+ val GLEE-Pro Overall IoU 69.6 # 6
Referring Video Object Segmentation Refer-YouTube-VOS GLEE-Plus J&F 67.7 # 2
J 65.6 # 2
F 69.7 # 2
Referring Video Object Segmentation Refer-YouTube-VOS GLEE-Pro J&F 70.6 # 1
J 68.2 # 1
F 72.9 # 1
Referring Expression Segmentation Refer-YouTube-VOS (2021 public validation) GLEE-Pro J&F 70.6 # 1
J 68.2 # 1
F 72.9 # 1
Multi-Object Tracking TAO GLEE-Pro TETA 47.2 # 1
LocA 66.2 # 1
AssocA 46.2 # 1
ClsA 29.1 # 2
Multi-Object Tracking TAO GLEE-Plus TETA 41.5 # 2
LocA 52.9 # 4
AssocA 40.9 # 2
ClsA 30.8 # 1
Multi-Object Tracking TAO GLEE-Lite TETA 40.1 # 3
LocA 56.3 # 3
AssocA 39.9 # 3
ClsA 24.1 # 3
Open-World Instance Segmentation UVO GLEE-Pro ARmask 72.6 # 1
Video Instance Segmentation YouTube-VIS validation GLEE-Pro mask AP 67.4 # 3

Methods


No methods listed for this paper. Add relevant methods here