no code implementations • 22 Apr 2024 • Dongze Hao, Qunbo Wang, Longteng Guo, Jie Jiang, Jing Liu
Knowledge-based Visual Question Answering (VQA) requires models to incorporate external knowledge to respond to questions about visual content.
no code implementations • 15 Mar 2024 • Dongze Hao, Jian Jia, Longteng Guo, Qunbo Wang, Te Yang, Yan Li, Yanhua Cheng, Bo wang, Quan Chen, Han Li, Jing Liu
We condense the retrieved knowledge passages from two perspectives.
1 code implementation • NeurIPS 2023 • Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, Jing Liu
Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).
Ranked #1 on Image Captioning on COCO Captions (SPICE metric, using extra training data)