1 code implementation • 16 Aug 2023 • Qi Chen, Chaorui Deng, Zixiong Huang, BoWen Zhang, Mingkui Tan, Qi Wu
In this paper, we propose to evaluate text-to-image generation performance by directly estimating the likelihood of the generated images using a pre-trained likelihood-based text-to-image generative model, i. e., a higher likelihood indicates better perceptual quality and better text-image alignment.
1 code implementation • ICCV 2023 • Chaorui Deng, Da Chen, Qi Wu
In Video Object Detection (VID), a common practice is to leverage the rich temporal contexts from the video to enhance the object representations in each frame.
Ranked #1 on Video Object Detection on ImageNet VID (MAP metric)
1 code implementation • ICCV 2023 • Chaorui Deng, Qi Chen, Pengda Qin, Da Chen, Qi Wu
In text-video retrieval, recent works have benefited from the powerful learning capabilities of pre-trained text-image foundation models (e. g., CLIP) by adapting them to the video domain.
1 code implementation • 17 Sep 2022 • Qi Chen, Chaorui Deng, Qi Wu
Our innovative idea is to explore the rich modes in the training caption corpus to learn a set of "mode embeddings", and further use them to control the mode of the generated captions for existing image captioning models.
no code implementations • CVPR 2021 • Chaorui Deng, ShiZhe Chen, Da Chen, Yuan He, Qi Wu
The dense video captioning task aims to detect and describe a sequence of events in a video for detailed and coherent storytelling.
no code implementations • 10 Oct 2020 • Yong Guo, Qingyao Wu, Chaorui Deng, Jian Chen, Mingkui Tan
Although the standard BN can significantly accelerate the training of DNNs and improve the generalization performance, it has several underlying limitations which may hamper the performance in both training and inference.
no code implementations • 19 Jul 2020 • Yanyuan Qiao, Chaorui Deng, Qi Wu
In this survey, we first examine the state of the art by comparing modern approaches to the problem.
1 code implementation • ECCV 2020 • Chaorui Deng, Ning Ding, Mingkui Tan, Qi Wu
We verify the merit of the proposed length level embedding on three models: two state-of-the-art (SOTA) autoregressive models with different types of decoder, as well as our proposed non-autoregressive model, to show its generalization ability.
42 code implementations • 20 Aug 2019 • Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, Bin Xiao
High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection.
Ranked #1 on Object Detection on COCO test-dev (Hardware Burden metric)
no code implementations • 12 Feb 2019 • Chaorui Deng, Qi Wu, Guanghui Xu, Zhuliang Yu, Yanwu Xu, Kui Jia, Mingkui Tan
Most state-of-the-art methods in VG operate in a two-stage manner, wherein the first stage an object detector is adopted to generate a set of object proposals from the input image and the second stage is simply formulated as a cross-modal matching problem that finds the best match between the language query and all region proposals.
no code implementations • CVPR 2018 • Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, Mingkui Tan
There are three main challenges in VG: 1) what is the main focus in a query; 2) how to understand an image; 3) how to locate an object.