1 code implementation • 14 Nov 2023 • Chen Li, Yixiao Ge, Dian Li, Ying Shan
Instruction tuning is a crucial supervised training phase in Large Language Models (LLMs), aiming to enhance the LLM's ability to generalize instruction execution and adapt to user preferences.
1 code implementation • 6 Apr 2023 • Chen Li, Yixiao Ge, Jiayong Mao, Dian Li, Ying Shan
Given a new entity that needs tagging for distribution, TagGPT introduces two alternative options for zero-shot tagging, i. e., a generative method with late semantic matching with the tag set, and another selective method with early matching in prompts.
1 code implementation • CVPR 2023 • Shusheng Yang, Yixiao Ge, Kun Yi, Dian Li, Ying Shan, XiaoHu Qie, Xinggang Wang
Both masked image modeling (MIM) and natural language supervision have facilitated the progress of transferable visual pre-training.
1 code implementation • 19 May 2022 • Kun Yi, Yixiao Ge, Xiaotong Li, Shusheng Yang, Dian Li, Jianping Wu, Ying Shan, XiaoHu Qie
Since the development of self-supervised visual representation learning from contrastive learning to masked image modeling (MIM), there is no significant difference in essence, that is, how to design proper pretext tasks for vision dictionary look-up.
no code implementations • 30 Mar 2022 • Rui Qian, Weiyao Lin, John See, Dian Li
The major reason is that the positive pairs, i. e., different clips sampled from the same video, have limited temporal receptive field, and usually share similar background but differ in motions.
2 code implementations • CVPR 2022 • Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, XiaoHu Qie, Ping Luo
As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e. g., action recognition with linear evaluation.
Ranked #8 on Zero-Shot Video Retrieval on MSVD
no code implementations • CVPR 2022 • Zhaoyang Zeng, Yongsheng Luo, Zhenhua Liu, Fengyun Rao, Dian Li, Weidong Guo, Zhen Wen
In this paper, we propose the Tencent-MVSE dataset, which is the first benchmark dataset for the multi-modal video similarity evaluation task.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
1 code implementation • 9 Dec 2021 • Lu Qi, Jason Kuen, Zhe Lin, Jiuxiang Gu, Fengyun Rao, Dian Li, Weidong Guo, Zhen Wen, Ming-Hsuan Yang, Jiaya Jia
To improve instance-level detection/segmentation performance, existing self-supervised and semi-supervised methods extract either task-unrelated or task-specific training signals from unlabeled data.
no code implementations • 13 Oct 2021 • Mingkang Tang, Zhanyu Wang, Zhenhua Liu, Fengyun Rao, Dian Li, Xiu Li
It is noted that our model is only trained on the MSR-VTT dataset.
no code implementations • 11 Oct 2021 • Mingkang Tang, Zhanyu Wang, Zhaoyang Zeng, Fengyun Rao, Dian Li
We make the following improvements on the proposed CLIP4Caption++: We employ an advanced encoder-decoder model architecture X-Transformer as our main framework and make the following improvements: 1) we utilize three strong pre-trained CLIP models to extract the text-related appearance visual features.
1 code implementation • ICCV 2021 • Rui Qian, Yuxi Li, Huabin Liu, John See, Shuangrui Ding, Xian Liu, Dian Li, Weiyao Lin
The crux of self-supervised video representation learning is to build general features from unlabeled videos.