no code implementations • 21 Feb 2024 • Jiawei Liang, Siyuan Liang, Man Luo, Aishan Liu, Dongchen Han, Ee-Chien Chang, Xiaochun Cao
Nevertheless, the frozen visual encoder in autoregressive VLMs imposes constraints on the learning of conventional image triggers.
no code implementations • 15 Dec 2023 • Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, Gao Huang
Generalized Referring Expression Segmentation (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image.
Generalized Referring Expression Segmentation Referring Expression +1
2 code implementations • 14 Dec 2023 • Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Shiji Song, Gao Huang
Specifically, the Agent Attention, denoted as a quadruple $(Q, A, K, V)$, introduces an additional set of agent tokens $A$ into the conventional attention module.
no code implementations • 7 Dec 2023 • Dongchen Han, Xiaojun Jia, Yang Bai, Jindong Gu, Yang Liu, Xiaochun Cao
Investigating the generation of high-transferability adversarial examples is crucial for uncovering VLP models' vulnerabilities in practical scenarios.
1 code implementation • ICCV 2023 • Dongchen Han, Xuran Pan, Yizeng Han, Shiji Song, Gao Huang
The quadratic computation complexity of self-attention has been a persistent challenge when applying Transformer models to vision tasks.
1 code implementation • ICCV 2023 • Yizeng Han, Dongchen Han, Zeyu Liu, Yulin Wang, Xuran Pan, Yifan Pu, Chao Deng, Junlan Feng, Shiji Song, Gao Huang
Early exits are placed exclusively within the classification branch, thus eliminating the need for linear separability in low-level features.
no code implementations • 17 Oct 2022 • Xuran Pan, Tianzhu Ye, Dongchen Han, Shiji Song, Gao Huang
Recent years have witnessed the fast development of large-scale pre-training frameworks that can extract multi-modal representations in a unified form and achieve promising performances when transferred to downstream tasks.
1 code implementation • CVPR 2022 • Haojun Jiang, Yuanze Lin, Dongchen Han, Shiji Song, Gao Huang
Our method leverages an off-the-shelf object detector to identify visual objects from unlabeled images, and then language queries for these objects are obtained in an unsupervised fashion with a pseudo-query generation module.