no code implementations • 31 Jan 2024 • Qirui Jiao, Daoyuan Chen, Yilun Huang, Yaliang Li, Ying Shen
Despite the impressive capabilities of Multimodal Large Language Models (MLLMs) in integrating text and image modalities, challenges remain in accurately interpreting detailed visual elements.
Ranked #41 on Visual Question Answering on MM-Vet