no code implementations • 19 Mar 2024 • Sensen Gao, Xiaojun Jia, Xuhong Ren, Ivor Tsang, Qing Guo
Vision-language pre-training (VLP) models exhibit remarkable capabilities in comprehending both images and text, yet they remain susceptible to multimodal adversarial examples (AEs).