1 code implementation • 10 May 2024 • Ting Liu, Xuyang Liu, Siteng Huang, Honggang Chen, Quanjun Yin, Long Qin, Donglin Wang, Yue Hu
Specifically, we propose \textbf{DARA}, a novel PETL method comprising \underline{\textbf{D}}omain-aware \underline{\textbf{A}}dapters (DA Adapters) and \underline{\textbf{R}}elation-aware \underline{\textbf{A}}dapters (RA Adapters) for VG.
1 code implementation • 21 Mar 2024 • Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, Donglin Wang
In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success.
1 code implementation • 15 Dec 2023 • Shuanghao Bai, Min Zhang, Wanqi Zhou, Siteng Huang, Zhirong Luan, Donglin Wang, Badong Chen
Therefore, in this paper, we first experimentally demonstrate that the unsupervised-trained VLMs can significantly reduce the distribution discrepancy between source and target domains, thereby improving the performance of UDA.
no code implementations • 27 Nov 2023 • Siteng Huang, Biao Gong, Yutong Feng, Xi Chen, Yuqian Fu, Yu Liu, Donglin Wang
Experimental results show that existing subject-driven customization methods fail to learn the representative characteristics of actions and struggle in decoupling actions from context features, including appearance.
no code implementations • 27 Nov 2023 • Biao Gong, Siteng Huang, Yutong Feng, Shiwei Zhang, Yuyuan Li, Yu Liu
To align the generated image with layout instructions, we present a training-free layout calibration system SimM that intervenes in the generative process on the fly during inference time.
1 code implementation • 3 Sep 2023 • Xuyang Liu, Siteng Huang, Yachen Kang, Honggang Chen, Donglin Wang
Large-scale text-to-image diffusion models have shown impressive capabilities for generative tasks by leveraging strong vision-language alignment from pre-training.
1 code implementation • 27 Mar 2023 • Siteng Huang, Biao Gong, Yutong Feng, Min Zhang, Yiliang Lv, Donglin Wang
Recent compositional zero-shot learning (CZSL) methods adapt pre-trained vision-language models (VLMs) by constructing trainable prompts only for composed state-object pairs.
1 code implementation • CVPR 2023 • Siteng Huang, Biao Gong, Yulin Pan, Jianwen Jiang, Yiliang Lv, Yuyuan Li, Donglin Wang
Many recent studies leverage the pre-trained CLIP for text-video cross-modal retrieval by tuning the backbone with additional heavy modules, which not only brings huge computational burdens with much more parameters, but also leads to the knowledge forgetting from upstream models.
1 code implementation • 22 Aug 2022 • Siteng Huang, Qiyao Wei, Donglin Wang
Compositional zero-shot learning (CZSL) refers to recognizing unseen compositions of known visual primitives, which is an essential ability for artificial intelligence systems to learn and understand the world.
1 code implementation • 14 Jul 2022 • Min Zhang, Siteng Huang, Wenbin Li, Donglin Wang
To solve this problem, we present a plug-in Hierarchical Tree Structure-aware (HTS) method, which not only learns the relationship of FSL and pretext tasks, but more importantly, can adaptively select and aggregate feature representations generated by pretext tasks to maximize the performance of FSL tasks.
no code implementations • 29 Sep 2021 • Siteng Huang, Qiyao Wei, Donglin Wang
To narrow the considerable gap between artificial and human intelligence, we propose a new task, namely reference-limited compositional learning (RLCL), which reproduces three core challenges to mimic human perception: compositional learning, few-shot, and few referential compositions.
no code implementations • CVPR 2021 • Zhengyu Chen, Jixie Ge, Heshen Zhan, Siteng Huang, Donglin Wang
While few-shot learning (FSL) aims for rapid generalization to new concepts with little supervision, self-supervised learning (SSL) constructs supervisory signals directly computed from unlabeled data.
1 code implementation • 10 Sep 2020 • Siteng Huang, Min Zhang, Yachen Kang, Donglin Wang
However, these approaches only augment the representations of samples with available semantics while ignoring the query set, which loses the potential for the improvement and may lead to a shift between the modalities combination and the pure-visual representation.