no code implementations • 19 Dec 2023 • Yong Xien Chng, Henry Zheng, Yizeng Han, Xuchong Qiu, Gao Huang
To tackle this challenge, we introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features, by explicitly teaching the model to learn fine-grained correspondence between masked textual tokens and their matching visual objects.
Ranked #2 on Referring Expression Segmentation on RefCOCO testB