Described Object Detection

8 papers with code • 1 benchmarks • 1 datasets

Described Object Detection (DOD) detects all instances on each image in the dataset, based on a flexible reference. It is a superset of Open-Vocabulary Object Detection (OVD) and Referring Expression Comprehension (REC). It expands category names to flexible language expressions for OVD and overcomes the limitation of REC only grounding the pre-existing object. Works related to DOD are tracked in awesome-DOD list on github.

Most implemented papers

Grounded Language-Image Pre-training

microsoft/GLIP CVPR 2022

The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich.

Simple Open-Vocabulary Object Detection with Vision Transformers

google-research/scenic 12 May 2022

Combining simple architectures with large-scale pre-training has led to massive improvements in image classification.

Described Object Detection: Liberating Object Detection with Flexible Expressions

charles-xie/awesome-described-object-detection NeurIPS 2023

In this paper, we advance them to a more practical setting called Described Object Detection (DOD) by expanding category names to flexible language expressions for OVD and overcoming the limitation of REC only grounding the pre-existing object.

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

microsoft/fiber NeurIPS 2022

Vision-language (VL) pre-training has recently received considerable attention.

Universal Instance Perception as Object Discovery and Retrieval

MasterBin-IIAU/UNINEXT CVPR 2023

All instance perception tasks aim at finding certain objects specified by some queries such as category names, language expressions, and target annotations, but this complete field has been split into multiple independent subtasks.

CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching

tgxs002/cora CVPR 2023

To overcome these obstacles, we propose CORA, a DETR-style framework that adapts CLIP for Open-vocabulary detection by Region prompting and Anchor pre-matching.

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

alpha-vllm/llama2-accessory 13 Nov 2023

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings.

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

open-mmlab/mmdetection 4 Jan 2024

Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC).