no code implementations • 23 Dec 2023 • Ning Wang, Jiajun Deng, Mingbo Jia
The proposed framework (1) allows the semi-weakly supervised training of visual grounding; (2) improves the performance of fully supervised visual grounding; (3) yields a general captioning model that can describe arbitrary image regions.
no code implementations • 18 Dec 2022 • Ning Wang, Jiangrong Xie, Hang Luo, Qinglin Cheng, Jihao Wu, Mingbo Jia, Linlin Li
On the other hand, we transfer the image-text retrieval design of CLIP to image captioning scenarios by devising a novel visual concept extractor and a cross-modal modulator.
no code implementations • 4 Dec 2022 • Ning Wang, Jiahao Xie, Jihao Wu, Mingbo Jia, Linlin Li
Despite the remarkable progress of image captioning, existing captioners typically lack the controllable capability to generate desired image captions, e. g., describing the image in a rough or detailed manner, in a factual or emotional view, etc.