1 code implementation • Findings (EMNLP) 2021 • Junjie Wang, Yatai Ji, Jiaqi Sun, Yujiu Yang, Tetsuya Sakai
On the other hand, trilinear models such as the CTI model efficiently utilize the inter-modality information between answers, questions, and images, while ignoring intra-modality information.
1 code implementation • 28 Mar 2024 • Sidi Yang, Binxiao Huang, Mingdeng Cao, Yatai Ji, Hanzhong Guo, Ngai Wong, Yujiu Yang
Existing enhancement models often optimize for high performance while falling short of reducing hardware inference time and power consumption, especially on edge devices with constrained computing and storage resources.
1 code implementation • 28 Nov 2023 • Yicheng Xiao, Zhuoyan Luo, Yong liu, Yue Ma, Hengwei Bian, Yatai Ji, Yujiu Yang, Xiu Li
Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted significant attention due to the growing demand for video analysis.
Ranked #1 on Highlight Detection on YouTube Highlights
1 code implementation • 12 Jun 2023 • Rong-Cheng Tu, Yatai Ji, Jie Jiang, Weijie Kong, Chengfei Cai, Wenzhe Zhao, Hongfa Wang, Yujiu Yang, Wei Liu
MGSC promotes learning more representative global features, which have a great impact on the performance of downstream tasks, while MLTC reconstructs modal-fusion local tokens, further enhancing accurate comprehension of multimodal data.
1 code implementation • 23 May 2023 • Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, Liang Lin
Based on a pre-trained conditional text-to-image (T2I) diffusion model, our model aims to generate videos conditioned on a sequence of control signals, such as edge or depth maps.
no code implementations • 9 Dec 2022 • Xinzhe Ni, Yong liu, Hao Wen, Yatai Ji, Jing Xiao, Yujiu Yang
Then in the visual flow, visual prototypes are computed by a Temporal-Relational CrossTransformer (TRX) module for example.
1 code implementation • CVPR 2023 • Yatai Ji, RongCheng Tu, Jie Jiang, Weijie Kong, Chengfei Cai, Wenzhe Zhao, Hongfa Wang, Yujiu Yang, Wei Liu
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correct corresponding information across different modalities.
Ranked #8 on Zero-Shot Video Retrieval on LSMDC
1 code implementation • CVPR 2023 • Yatai Ji, Junjie Wang, Yuan Gong, Lin Zhang, Yanru Zhu, Hongfa Wang, Jiaxing Zhang, Tetsuya Sakai, Yujiu Yang
Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets.