no code implementations • 21 Feb 2024 • Yunxin Li, Xinyu Chen, Baotian Hu, Haoyuan Shi, Min Zhang
Evaluating and Rethinking the current landscape of Large Multimodal Models (LMMs), we observe that widely-used visual-language projection approaches (e. g., Q-former or MLP) focus on the alignment of image-text descriptions yet ignore the visual knowledge-dimension alignment, i. e., connecting visuals to their relevant knowledge.
no code implementations • 15 May 2023 • Xuanchen Li, Yan Niu, Bo Zhao, Haoyuan Shi, Zitong An
In both applications, our model substantially alleviates artifacts such as Moir\'e and over-smoothness at similar or lower computational cost to currently top-performing models, as validated by diverse evaluations.