no code implementations • 18 Jan 2024 • Taichi Nishimura, Shota Nakada, Masayoshi Kondo
This paper refers to this as audio hallucinations and analyzes them in large audio-video language models.
no code implementations • 1 Dec 2023 • Taichi Nishimura, Shota Nakada, Masayoshi Kondo
The zero-shot QASIR yields two discoveries: (1) it enables VLMs to generalize to super images and (2) the grid size $N$, image resolution, and VLM size are key trade-off parameters between performance and computation costs.
no code implementations • 23 Oct 2023 • Shuhei Yokoo, Peifei Zhu, Yuchi Ishikawa, Mikihiro Tanaka, Masayoshi Kondo, Hirokatsu Kataoka
Our solution adopts large multimodal models CLIP and BLIP-2 to filter and modify web crawl data, and utilize external datasets along with a bag of tricks to improve the data quality.