no code implementations • 29 Dec 2023 • Jiaxi Wang, Wenhui Hu, Xueyang Liu, Beihu Wu, Yuting Qiu, YingYing Cai
EpmVG is based on a novel cross-modal distillation mechanism, which can effectively introduce the consistency information of images and texts in the pre-trained model, to reduce the domain gap existing in the backbone networks, thereby improving the performance of the model in the visual grounding task.