PolyFormer: Referring Image Segmentation as Sequential Polygon Generation

In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens as input, and outputs a sequence of polygon vertices autoregressively. For more accurate geometric localization, we propose a regression-based decoder, which predicts the precise floating-point coordinates directly, without any coordinate quantization error. In the experiments, PolyFormer outperforms the prior art by a clear margin, e.g., 5.40% and 4.52% absolute improvements on the challenging RefCOCO+ and RefCOCOg datasets. It also shows strong generalization ability when evaluated on the referring video segmentation task without fine-tuning, e.g., achieving competitive 61.5% J&F on the Ref-DAVIS17 dataset.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Results from the Paper


 Ranked #1 on Referring Expression Segmentation on ReferIt (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Referring Expression Segmentation DAVIS 2017 (val) PolyFormer J&F 1st frame 61.5 # 4
Referring Expression Comprehension RefCoco+ PolyFormer-L Val 84.98 # 4
Test A 89.77 # 3
Test B 77.97 # 4
Referring Expression Comprehension RefCoco+ PolyFormer-B Val 83.73 # 5
Test A 88.6 # 5
Test B 76.38 # 5
Referring Expression Comprehension RefCOCO PolyFormer-L Val 90.38 # 6
Test A 92.89 # 4
Test B 87.16 # 4
Referring Expression Comprehension RefCOCO PolyFormer-B Val 89.73 # 7
Test A 91.73 # 6
Test B 86.03 # 5
Referring Expression Comprehension RefCOCOg-test PolyFormer-B Accuracy 84.96 # 6
Referring Expression Segmentation RefCOCOg-test PolyFormer-B Overall IoU 69.05 # 5
Mean IoU 69.88 # 2
Referring Expression Comprehension RefCOCOg-test PolyFormer-L Accuracy 85.91 # 5
Referring Expression Segmentation RefCOCOg-test PolyFormer-L Overall IoU 70.19 # 4
Mean IoU 71.17 # 1
Referring Expression Segmentation RefCOCOg-val PolyFormer-L Overall IoU 69.2 # 5
Mean IoU 71.15 # 1
Referring Expression Comprehension RefCOCOg-val PolyFormer-B Accuracy 84.46 # 7
Referring Expression Comprehension RefCOCOg-val PolyFormer-L Accuracy 85.83 # 6
Referring Expression Segmentation RefCOCOg-val PolyFormer-B Overall IoU 67.76 # 6
Mean IoU 69.36 # 2
Referring Expression Segmentation RefCOCO testA PolyFormer-B Overall IoU 76.64 # 7
Mean IoU 77.09 # 2
Referring Expression Segmentation RefCOCO testA PolyFormer-L Overall IoU 78.29 # 5
Mean IoU 78.49 # 1
Referring Expression Segmentation RefCOCO+ testA PolyFormer-B Overall IoU 72.89 # 6
Mean IoU 74.51 # 2
Referring Expression Segmentation RefCOCO+ testA PolyFormer-L Overall IoU 74.56 # 5
Mean IoU 75.71 # 1
Referring Expression Segmentation RefCOCO testB PolyFormer-B Overall IoU 71.06 # 4
Mean IoU 73.22 # 2
Referring Expression Segmentation RefCOCO testB PolyFormer-L Overall IoU 73.25 # 3
Mean IoU 74.83 # 1
Referring Expression Segmentation RefCOCO+ test B PolyFormer-B Overall IoU 59.33 # 6
Mean IoU 64.64 # 2
Referring Expression Segmentation RefCOCO+ test B PolyFormer-L Overall IoU 61.87 # 5
Mean IoU 66.73 # 1
Referring Expression Segmentation RefCoCo val PolyFormer-L Overall IoU 75.96 # 3
Overall IoU 75.96 # 6
Mean IoU 76.94 # 1
Referring Expression Segmentation RefCoCo val PolyFormer-B Overall IoU 74.82 # 5
Overall IoU 74.82 # 8
Referring Expression Segmentation RefCOCO+ val PolyFormer-B Overall IoU 67.64 # 8
Mean IoU 70.65 # 2
Referring Expression Segmentation RefCOCO+ val PolyFormer-L Overall IoU 69.33 # 7
Mean IoU 72.15 # 1
Referring Expression Segmentation ReferIt PolyFormer-B Overall IoU 71.91 # 2
Mean IoU 65.98 # 2
Referring Expression Segmentation ReferIt PolyFormer-L Overall IoU 72.6 # 1
Mean IoU 67.22 # 1

Methods