1 code implementation • 5 Dec 2023 • Fengyuan Shi, Jiaxi Gu, Hang Xu, Songcen Xu, Wei zhang, LiMin Wang
Now text-to-image foundation models are widely applied to various downstream image synthesis tasks, such as controllable image generation and image editing, while downstream video synthesis tasks are less explored for several reasons.
no code implementations • 26 Oct 2023 • Fengyuan Shi, LiMin Wang
Despite the success of transformers on various computer vision tasks, they suffer from excessive memory and computational cost.
no code implementations • 17 Apr 2023 • Chen Xu, Haocheng Shen, Fengyuan Shi, Boheng Chen, Yixuan Liao, Xiaoxin Chen, LiMin Wang
To the best of our knowledge, we are the first to demonstrate the superior performance of visual prompts in V-L models to previous prompt-based methods in downstream tasks.
no code implementations • 28 Sep 2022 • Fengyuan Shi, Ruopeng Gao, Weilin Huang, LiMin Wang
The sampling module aims to select these informative patches by predicting the offsets with respect to a reference point, while the decoding module works for extracting the grounded object information by performing cross attention between image features and text features.
no code implementations • 23 Sep 2021 • Fengyuan Shi, Weilin Huang, LiMin Wang
In this paper, we tackle a new problem of dense video grounding, by simultaneously localizing multiple moments with a paragraph as input.