1 code implementation • 15 Apr 2024 • Yaohui Li, Qifeng Zhou, Haoxing Chen, Jianbing Zhang, Xinyu Dai, Hao Zhou
Few-shot learning aims to further enhance the transfer capability of CLIP by giving few images in each class, aka 'few shots'.
no code implementations • 23 Mar 2024 • Lingxing Kong, Yougang Chu, Zheng Ma, Jianbing Zhang, Liang He, Jiajun Chen
Relation extraction is a critical task in the field of natural language processing with numerous real-world applications.
no code implementations • 18 Feb 2024 • Zheng Ma, Changxin Wang, Yawen Ouyang, Fei Zhao, Jianbing Zhang, ShuJian Huang, Jiajun Chen
If a certain metric has flaws, it will be exploited by the model and reflected in the generated sentences.
no code implementations • 15 Feb 2024 • Shangyu Xing, Fei Zhao, Zhen Wu, Tuo An, WeiHao Chen, Chunhui Li, Jianbing Zhang, Xinyu Dai
Multimodal large language models (MLLMs) have attracted increasing attention in the past few years, but they may still generate descriptions that include objects not present in the corresponding images, a phenomenon known as object hallucination.
1 code implementation • 17 Jan 2024 • Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, Zhiyong Wu
In our preliminary study, we have discovered a key challenge in developing visual GUI agents: GUI grounding -- the capacity to accurately locate screen elements based on instructions.
1 code implementation • 23 Oct 2023 • Fei Zhao, Chunhui Li, Zhen Wu, Yawen Ouyang, Jianbing Zhang, Xinyu Dai
Therefore, in this work, we focus on whether the negative impact of noisy images can be reduced without modifying the data.
1 code implementation • 15 Oct 2023 • Zheng Ma, Changxin Wang, Bo Huang, Zixuan Zhu, Jianbing Zhang
Several models adopted a non-autoregressive manner to speed up the process.
1 code implementation • 9 Oct 2023 • Shangyu Xing, Fei Zhao, Zhen Wu, Chunhui Li, Jianbing Zhang, Xinyu Dai
Multimodal Entity Linking (MEL) is a task that aims to link ambiguous mentions within multimodal contexts to referential entities in a multimodal knowledge base.
1 code implementation • 6 Aug 2023 • Zheng Ma, Mianzhi Pan, Wenhan Wu, Kanzhi Cheng, Jianbing Zhang, ShuJian Huang, Jiajun Chen
Experiments on our proposed datasets demonstrate that popular VLMs underperform in the food domain compared with their performance in the general domain.
1 code implementation • 2 Aug 2023 • Kanzhi Cheng, Zheng Ma, Shi Zong, Jianbing Zhang, Xinyu Dai, Jiajun Chen
Generating visually grounded image captions with specific linguistic styles using unpaired stylistic corpora is a challenging task, especially since we expect stylized captions with a wide variety of stylistic patterns.
1 code implementation • 2 Aug 2023 • Kanzhi Cheng, Wenpo Song, Zheng Ma, Wenhao Zhu, Zixuan Zhu, Jianbing Zhang
Considering that Vision-Language Pre-Training (VLP) models master massive such knowledge from large-scale web-harvested data, it is promising to utilize the generalizability of VLP models to incorporate knowledge into image descriptions.
no code implementations • 18 Oct 2022 • Zheng Ma, Shi Zong, Mianzhi Pan, Jianbing Zhang, ShuJian Huang, Xinyu Dai, Jiajun Chen
In recent years, vision and language pre-training (VLP) models have advanced the state-of-the-art results in a variety of cross-modal downstream tasks.
no code implementations • 2 Oct 2022 • Zhihuan Kuang, Shi Zong, Jianbing Zhang, Jiajun Chen, Hongfu Liu
In this paper, we consider a novel research problem: music-to-text synaesthesia.
1 code implementation • ACL 2019 • Peng Wu, Shu-Jian Huang, Rongxiang Weng, Zaixiang Zheng, Jianbing Zhang, Xiaohui Yan, Jia-Jun Chen
However, one critical problem is that current approaches only get high accuracy for questions whose relations have been seen in the training data.