1 code implementation • 24 May 2024 • Byung-Kwan Lee, Chae Won Kim, Beomchan Park, Yong Man Ro
Recently, open-source LLVMs have curated high-quality visual instruction tuning datasets and utilized additional vision encoders or multiple computer vision models in order to narrow the performance gap with powerful closed-source LLVMs.
1 code implementation • 12 Mar 2024 • Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro
Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models.
Ranked #27 on Visual Question Answering on MM-Vet
no code implementations • 7 Mar 2024 • Seunghee Han, Se Jin Park, Chae Won Kim, Yong Man Ro
We devise completeness loss and consistency loss based on semantic similarity scores.
1 code implementation • 17 Feb 2024 • Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro
Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks.
Ranked #35 on Visual Question Answering on MM-Vet
no code implementations • 27 Feb 2023 • Minsu Kim, Chae Won Kim, Yong Man Ro
The proposed DVFA can align the input transcription (i. e., sentence) with the talking face video without accessing the speech audio.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3