TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Long-tail Video Object Segmentation	BURST	GLEE-Lite	HOTA (all)	22.6	# 1
Long-tail Video Object Segmentation	BURST	GLEE-Lite	mAP (all)	12.6	# 1
Long-tail Video Object Segmentation	BURST	GLEE-Lite	HOTA (com)	36.4	# 1
Long-tail Video Object Segmentation	BURST	GLEE-Lite	mAP (com)	18.9	# 1
Long-tail Video Object Segmentation	BURST	GLEE-Lite	HOTA (unc)	19.1	# 1
Long-tail Video Object Segmentation	BURST	GLEE-Lite	mAP (unc)	11.0	# 1
Long-tail Video Object Segmentation	BURST-val	GLEE-Lite	HOTA (all)	22.6	# 3
Long-tail Video Object Segmentation	BURST-val	GLEE-Lite	mAP (all)	12.6	# 3
Long-tail Video Object Segmentation	BURST-val	GLEE-Lite	HOTA (com)	36.4	# 3
Long-tail Video Object Segmentation	BURST-val	GLEE-Lite	mAP (com)	18.9	# 3
Long-tail Video Object Segmentation	BURST-val	GLEE-Lite	HOTA (unc)	19.1	# 3
Long-tail Video Object Segmentation	BURST-val	GLEE-Lite	mAP (unc)	11.0	# 3
Long-tail Video Object Segmentation	BURST-val	GLEE-Plus	HOTA (all)	26.9	# 2
Long-tail Video Object Segmentation	BURST-val	GLEE-Plus	mAP (all)	17.2	# 2
Long-tail Video Object Segmentation	BURST-val	GLEE-Plus	HOTA (com)	38.8	# 2
Long-tail Video Object Segmentation	BURST-val	GLEE-Plus	mAP (com)	23.7	# 2
Long-tail Video Object Segmentation	BURST-val	GLEE-Plus	HOTA (unc)	23.9	# 2
Long-tail Video Object Segmentation	BURST-val	GLEE-Plus	mAP (unc)	15.5	# 2
Long-tail Video Object Segmentation	BURST-val	GLEE-Pro	HOTA (all)	31.2	# 1
Long-tail Video Object Segmentation	BURST-val	GLEE-Pro	mAP (all)	19.2	# 1
Long-tail Video Object Segmentation	BURST-val	GLEE-Pro	HOTA (com)	48.7	# 1
Long-tail Video Object Segmentation	BURST-val	GLEE-Pro	mAP (com)	24.8	# 1
Long-tail Video Object Segmentation	BURST-val	GLEE-Pro	HOTA (unc)	26.9	# 1
Long-tail Video Object Segmentation	BURST-val	GLEE-Pro	mAP (unc)	17.7	# 1
Instance Segmentation	COCO minival	GLEE-Lite	mask AP	48.4	# 35
Object Detection	COCO minival	GLEE-Lite	box AP	55.0	# 50
Object Detection	COCO minival	GLEE-Plus	box AP	60.4	# 21
Object Detection	COCO minival	GLEE-Pro	box AP	62.0	# 14
Instance Segmentation	COCO minival	GLEE-Pro	mask AP	54.2	# 5
Instance Segmentation	COCO minival	GLEE-Plus	mask AP	53.0	# 9
Instance Segmentation	COCO test-dev	GLEE-Pro	mask AP	54.5	# 6
Object Detection	COCO test-dev	GLEE-Plus	box mAP	60.6	# 24
Object Detection	COCO test-dev	GLEE-Lite	box mAP	54.7	# 49
Object Detection	COCO test-dev	GLEE-Pro	box mAP	62.3	# 20
Instance Segmentation	COCO test-dev	GLEE-Lite	mask AP	48.3	# 27
Instance Segmentation	COCO test-dev	GLEE-Plus	mask AP	53.3	# 9
Object Detection	LVIS v1.0 val	GLEE-Pro	box AP	55.7	# 5
Instance Segmentation	LVIS v1.0 val	GLEE-Pro	mask AP	49.9	# 4
Video Instance Segmentation	OVIS validation	GLEE-Pro	mask AP	50.4	# 2
Video Instance Segmentation	OVIS validation	GLEE-Pro	AP75	55.5	# 2
Referring Expression Comprehension	RefCoco+	GLEE-Pro	Val	82.6	# 6
Referring Expression Comprehension	RefCOCO	GLEE-Pro	Val	91.0	# 4
Referring Expression Segmentation	RefCOCO	GLEE-Pro	IoU	80.0	# 1
Referring Expression Segmentation	RefCOCOg-val	GLEE-Pro	Overall IoU	72.9	# 4
Referring Expression Comprehension	RefCOCOg-val	GLEE-Pro	Accuracy	86.4	# 5
Referring Expression Segmentation	RefCoCo val	GLEE-Pro	Overall IoU	80.0	# 4
Referring Expression Segmentation	RefCOCO+ val	GLEE-Pro	Overall IoU	69.6	# 6
Referring Video Object Segmentation	Refer-YouTube-VOS	GLEE-Plus	J&F	67.7	# 2
Referring Video Object Segmentation	Refer-YouTube-VOS	GLEE-Plus	J	65.6	# 2
Referring Video Object Segmentation	Refer-YouTube-VOS	GLEE-Plus	F	69.7	# 2
Referring Video Object Segmentation	Refer-YouTube-VOS	GLEE-Pro	J&F	70.6	# 1
Referring Video Object Segmentation	Refer-YouTube-VOS	GLEE-Pro	J	68.2	# 1
Referring Video Object Segmentation	Refer-YouTube-VOS	GLEE-Pro	F	72.9	# 1
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	GLEE-Pro	J&F	70.6	# 1
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	GLEE-Pro	J	68.2	# 1
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	GLEE-Pro	F	72.9	# 1
Multi-Object Tracking	TAO	GLEE-Pro	TETA	47.2	# 1
Multi-Object Tracking	TAO	GLEE-Pro	LocA	66.2	# 1
Multi-Object Tracking	TAO	GLEE-Pro	AssocA	46.2	# 1
Multi-Object Tracking	TAO	GLEE-Pro	ClsA	29.1	# 2
Multi-Object Tracking	TAO	GLEE-Plus	TETA	41.5	# 2
Multi-Object Tracking	TAO	GLEE-Plus	LocA	52.9	# 4
Multi-Object Tracking	TAO	GLEE-Plus	AssocA	40.9	# 2
Multi-Object Tracking	TAO	GLEE-Plus	ClsA	30.8	# 1
Multi-Object Tracking	TAO	GLEE-Lite	TETA	40.1	# 3
Multi-Object Tracking	TAO	GLEE-Lite	LocA	56.3	# 3
Multi-Object Tracking	TAO	GLEE-Lite	AssocA	39.9	# 3
Multi-Object Tracking	TAO	GLEE-Lite	ClsA	24.1	# 3
Open-World Instance Segmentation	UVO	GLEE-Pro	ARmask	72.6	# 1
Video Instance Segmentation	YouTube-VIS validation	GLEE-Pro	mask AP	67.4	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/long-tail-video-object-segmentation-on-burst)](https://paperswithcode.com/sota/long-tail-video-object-segmentation-on-burst?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/long-tail-video-object-segmentation-on-burst-1)](https://paperswithcode.com/sota/long-tail-video-object-segmentation-on-burst-1?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-segmentation-on-refcoco-6)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcoco-6?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-video-object-segmentation-on-refer)](https://paperswithcode.com/sota/referring-video-object-segmentation-on-refer?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-segmentation-on-refer-1)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refer-1?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/multi-object-tracking-on-tao)](https://paperswithcode.com/sota/multi-object-tracking-on-tao?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/open-world-instance-segmentation-on-uvo)](https://paperswithcode.com/sota/open-world-instance-segmentation-on-uvo?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/video-instance-segmentation-on-ovis-1)](https://paperswithcode.com/sota/video-instance-segmentation-on-ovis-1?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/video-instance-segmentation-on-youtube-vis-1)](https://paperswithcode.com/sota/video-instance-segmentation-on-youtube-vis-1?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/instance-segmentation-on-lvis-v1-0-val)](https://paperswithcode.com/sota/instance-segmentation-on-lvis-v1-0-val?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-comprehension-on-refcoco)](https://paperswithcode.com/sota/referring-expression-comprehension-on-refcoco?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-segmentation-on-refcocog)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcocog?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-segmentation-on-refcoco)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcoco?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/instance-segmentation-on-coco-minival)](https://paperswithcode.com/sota/instance-segmentation-on-coco-minival?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/object-detection-on-lvis-v1-0-val)](https://paperswithcode.com/sota/object-detection-on-lvis-v1-0-val?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-comprehension-on)](https://paperswithcode.com/sota/referring-expression-comprehension-on?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/instance-segmentation-on-coco)](https://paperswithcode.com/sota/instance-segmentation-on-coco?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-comprehension-on-refcoco-1)](https://paperswithcode.com/sota/referring-expression-comprehension-on-refcoco-1?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/referring-expression-segmentation-on-refcoco-3)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refcoco-3?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=general-object-foundation-model-for-images)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/general-object-foundation-model-for-images/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=general-object-foundation-model-for-images)`

General Object Foundation Model for Images and Videos at Scale

14 Dec 2023 · Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai ·

We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations, excelling in zero-shot transfer to new data and tasks. Specifically, we employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks, GLEE exhibits remarkable versatility and improved generalization performance, efficiently tackling downstream tasks without the need for task-specific adaptation. By integrating large volumes of automatically labeled data, we further enhance its zero-shot generalization capabilities. Additionally, GLEE is capable of being integrated into Large Language Models, serving as a foundational model to provide universal object-level information for multi-modal tasks. We hope that the versatility and universality of our method will mark a significant step in the development of efficient visual foundation models for AGI systems. The model and code will be released at https://glee-vision.github.io .

PDF Abstract

Code

Add Remove Mark official

FoundationVision/GLEE official

↳ Quickstart in

Spaces

888

Tasks

Add Remove

Instance Segmentation

Long-tail Video Object Segmentation

Multi-Object Tracking

Object

Object Detection

Open-World Instance Segmentation

Referring Expression Comprehension

Referring Expression Segmentation

Referring Video Object Segmentation

Video Instance Segmentation

Video Object Segmentation

Zero-shot Generalization

Datasets

MS COCO

Visual Genome

LVIS

RefCOCO

YouTube-VOS 2018

YouTube-VIS 2019

Objects365

SA-1B

OVIS

TAO

Refer-YouTube-VOS Google Refexp

UVO

BURST

Results from the Paper

Add Remove

Ranked #1 on Referring Expression Segmentation on Refer-YouTube-VOS (2021 public validation) (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Long-tail Video Object Segmentation	BURST	GLEE-Lite	HOTA (all)	22.6	# 1	Compare
			mAP (all)	12.6	# 1	Compare
			HOTA (com)	36.4	# 1	Compare
			mAP (com)	18.9	# 1	Compare
			HOTA (unc)	19.1	# 1	Compare
			mAP (unc)	11.0	# 1	Compare
Long-tail Video Object Segmentation	BURST-val	GLEE-Lite	HOTA (all)	22.6	# 3	Compare
			mAP (all)	12.6	# 3	Compare
			HOTA (com)	36.4	# 3	Compare
			mAP (com)	18.9	# 3	Compare
			HOTA (unc)	19.1	# 3	Compare
			mAP (unc)	11.0	# 3	Compare
Long-tail Video Object Segmentation	BURST-val	GLEE-Plus	HOTA (all)	26.9	# 2	Compare
			mAP (all)	17.2	# 2	Compare
			HOTA (com)	38.8	# 2	Compare
			mAP (com)	23.7	# 2	Compare
			HOTA (unc)	23.9	# 2	Compare
			mAP (unc)	15.5	# 2	Compare
Long-tail Video Object Segmentation	BURST-val	GLEE-Pro	HOTA (all)	31.2	# 1	Compare
			mAP (all)	19.2	# 1	Compare
			HOTA (com)	48.7	# 1	Compare
			mAP (com)	24.8	# 1	Compare
			HOTA (unc)	26.9	# 1	Compare
			mAP (unc)	17.7	# 1	Compare
Instance Segmentation	COCO minival	GLEE-Lite	mask AP	48.4	# 35	Compare
Object Detection	COCO minival	GLEE-Lite	box AP	55.0	# 50	Compare
Object Detection	COCO minival	GLEE-Plus	box AP	60.4	# 21	Compare
Object Detection	COCO minival	GLEE-Pro	box AP	62.0	# 14	Compare
Instance Segmentation	COCO minival	GLEE-Pro	mask AP	54.2	# 5	Compare
Instance Segmentation	COCO minival	GLEE-Plus	mask AP	53.0	# 9	Compare
Instance Segmentation	COCO test-dev	GLEE-Pro	mask AP	54.5	# 6	Compare
Object Detection	COCO test-dev	GLEE-Plus	box mAP	60.6	# 24	Compare
Object Detection	COCO test-dev	GLEE-Lite	box mAP	54.7	# 49	Compare
Object Detection	COCO test-dev	GLEE-Pro	box mAP	62.3	# 20	Compare
Instance Segmentation	COCO test-dev	GLEE-Lite	mask AP	48.3	# 27	Compare
Instance Segmentation	COCO test-dev	GLEE-Plus	mask AP	53.3	# 9	Compare
Object Detection	LVIS v1.0 val	GLEE-Pro	box AP	55.7	# 5	Compare
Instance Segmentation	LVIS v1.0 val	GLEE-Pro	mask AP	49.9	# 4	Compare
Video Instance Segmentation	OVIS validation	GLEE-Pro	mask AP	50.4	# 2	Compare
Video Instance Segmentation	OVIS validation	GLEE-Pro	AP75	55.5	# 2	Compare
Referring Expression Comprehension	RefCoco+	GLEE-Pro	Val	82.6	# 6	Compare
Referring Expression Comprehension	RefCOCO	GLEE-Pro	Val	91.0	# 4	Compare
Referring Expression Segmentation	RefCOCO	GLEE-Pro	IoU	80.0	# 1	Compare
Referring Expression Segmentation	RefCOCOg-val	GLEE-Pro	Overall IoU	72.9	# 4	Compare
Referring Expression Comprehension	RefCOCOg-val	GLEE-Pro	Accuracy	86.4	# 5	Compare
Referring Expression Segmentation	RefCoCo val	GLEE-Pro	Overall IoU	80.0	# 4	Compare
Referring Expression Segmentation	RefCOCO+ val	GLEE-Pro	Overall IoU	69.6	# 6	Compare
Referring Video Object Segmentation	Refer-YouTube-VOS	GLEE-Plus	J&F	67.7	# 2	Compare
			J	65.6	# 2	Compare
			F	69.7	# 2	Compare
Referring Video Object Segmentation	Refer-YouTube-VOS	GLEE-Pro	J&F	70.6	# 1	Compare
			J	68.2	# 1	Compare
			F	72.9	# 1	Compare
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	GLEE-Pro	J&F	70.6	# 1	Compare
			J	68.2	# 1	Compare
			F	72.9	# 1	Compare
Multi-Object Tracking	TAO	GLEE-Pro	TETA	47.2	# 1	Compare
			LocA	66.2	# 1	Compare
			AssocA	46.2	# 1	Compare
			ClsA	29.1	# 2	Compare
Multi-Object Tracking	TAO	GLEE-Plus	TETA	41.5	# 2	Compare
			LocA	52.9	# 4	Compare
			AssocA	40.9	# 2	Compare
			ClsA	30.8	# 1	Compare
Multi-Object Tracking	TAO	GLEE-Lite	TETA	40.1	# 3	Compare
			LocA	56.3	# 3	Compare
			AssocA	39.9	# 3	Compare
			ClsA	24.1	# 3	Compare
Open-World Instance Segmentation	UVO	GLEE-Pro	ARmask	72.6	# 1	Compare
Video Instance Segmentation	YouTube-VIS validation	GLEE-Pro	mask AP	67.4	# 3	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

General Object Foundation Model for Images and Videos at Scale

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove