Generalized Decoding for Pixel, Image, and Language

We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing). Code, demo, video, and visualization are available at https://x-decoder-vl.github.io.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Results from the Paper


Ranked #4 on Instance Segmentation on ADE20K val (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Instance Segmentation ADE20K val X-Decoder (Davit-d5, Deform, single-scale, 1280x1280) AP 38.7 # 4
APS 18.9 # 3
APM 43.3 # 3
APL 59.6 # 3
Instance Segmentation ADE20K val X-Decoder (L) AP 35.8 # 7
Panoptic Segmentation ADE20K val X-Decoder (L) PQ 49.6 # 12
AP 35.8 # 9
mIoU 58.1 # 6
Panoptic Segmentation ADE20K val X-Decoder (Davit-d5, Deform, single-scale, 1280x1280) PQ 52.4 # 4
AP 38.7 # 2
mIoU 59.1 # 2
Referring Expression Segmentation RefCOCOg-val X-Decoder (Davit-d5) Overall IoU 64.6 # 8
Zero Shot Segmentation Segmentation in the Wild SGinW_Team (X-Decoder-B) Mean AP 27.7 # 10
Zero Shot Segmentation Segmentation in the Wild SGinW_Team (X-Decoder-L) Mean AP 32.2 # 9
Zero Shot Segmentation Segmentation in the Wild SGinW_Team (X-Decoder-T) Mean AP 22.6 # 12
Zero Shot Segmentation Segmentation in the Wild SGinW_Team (X-Decoder-L-IN21K) Mean AP 26.6 # 11

Methods


No methods listed for this paper. Add relevant methods here