1 code implementation • 19 Feb 2024 • Christian Schlarmann, Naman Deep Singh, Francesco Croce, Matthias Hein
The CLIP model, or one of its variants, is used as a frozen vision encoder in many vision-language models (VLMs), e. g. LLaVA and OpenFlamingo.
1 code implementation • 21 Aug 2023 • Christian Schlarmann, Matthias Hein
In this paper we show that imperceivable attacks on images in order to change the caption output of a multi-modal foundation model can be used by malicious content providers to harm honest users e. g. by guiding them to malicious websites or broadcast fake information.