no code implementations • 2 Jul 2023 • Meng Lan, Fu Rong, Zuchao Li, Wei Yu, Lefei Zhang
Moreover, a bidirectional vision-language interaction module is implemented before the multimodal Transformer to enhance the correlation between the visual and linguistic features, thus facilitating the language queries to decode more precise object information from visual features and ultimately improving the segmentation performance.