1 code implementation • 2 Apr 2024 • Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion
We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks.
1 code implementation • 28 Mar 2024 • Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, Eric Wong
To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) a new jailbreaking dataset containing 100 unique behaviors, which we call JBB-Behaviors; (2) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (3) a standardized evaluation framework that includes a clearly defined threat model, system prompts, chat templates, and scoring functions; and (4) a leaderboard that tracks the performance of attacks and defenses for various LLMs.
1 code implementation • 19 Feb 2024 • Christian Schlarmann, Naman Deep Singh, Francesco Croce, Matthias Hein
The CLIP model, or one of its variants, is used as a frozen vision encoder in many vision-language models (VLMs), e. g. LLaVA and OpenFlamingo.
1 code implementation • 7 Feb 2024 • Hao Zhao, Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion
There is a consensus that instruction fine-tuning of LLMs requires high-quality data, but what are they?
no code implementations • 24 Nov 2023 • Francesco Croce, Matthias Hein
General purpose segmentation models are able to generate (semantic) segmentation masks from a variety of prompts, including visual (points, boxed, etc.)
1 code implementation • 22 Jun 2023 • Francesco Croce, Naman D Singh, Matthias Hein
While a large amount of work has focused on designing adversarial attacks against image classifiers, only a few methods exist to attack semantic segmentation models.
1 code implementation • NeurIPS 2023 • Naman D Singh, Francesco Croce, Matthias Hein
While adversarial training has been extensively studied for ResNet architectures and low resolution datasets like CIFAR, much less is known for ImageNet.
no code implementations • CVPR 2023 • Francesco Croce, Sylvestre-Alvise Rebuffi, Evan Shelhamer, Sven Gowal
Adversarial training is widely used to make classifiers robust to a specific threat or adversary, such as $\ell_p$-norm bounded perturbations of a given $p$-norm.
1 code implementation • 14 Feb 2023 • Maksym Andriushchenko, Francesco Croce, Maximilian Müller, Matthias Hein, Nicolas Flammarion
Overall, we observe that sharpness does not correlate well with generalization but rather with some training parameters like the learning rate that can be positively or negatively correlated with generalization depending on the setup.
1 code implementation • 21 Oct 2022 • Maximilian Augustin, Valentyn Boreiko, Francesco Croce, Matthias Hein
Two modifications to the diffusion process are key for our DVCEs: first, an adaptive parameterization, whose hyperparameters generalize across images and models, together with distance regularization and late start of the diffusion process, allow us to generate images with minimal semantic changes to the original ones but different classification.
no code implementations • 10 Oct 2022 • Sylvestre-Alvise Rebuffi, Francesco Croce, Sven Gowal
By co-training a neural network on clean and adversarial inputs, it is possible to improve classification accuracy on the clean, non-adversarial inputs.
no code implementations • 14 Sep 2022 • Francesco Croce, Matthias Hein
In recent years novel architecture components for image classification have been developed, starting with attention and patches used in transformers.
1 code implementation • 16 May 2022 • Valentyn Boreiko, Maximilian Augustin, Francesco Croce, Philipp Berens, Matthias Hein
Visual counterfactual explanations (VCEs) in image space are an important tool to understand decisions of image classifiers as they show under which changes of the image the decision of the classifier would change.
1 code implementation • 28 Feb 2022 • Francesco Croce, Sven Gowal, Thomas Brunner, Evan Shelhamer, Matthias Hein, Taylan Cemgil
Adaptive defenses, which optimize at test time, promise to improve adversarial robustness.
1 code implementation • 26 May 2021 • Francesco Croce, Matthias Hein
In this way we get the first multiple-norm robust model for ImageNet and boost the state-of-the-art for multiple-norm robustness to more than $51\%$ on CIFAR-10.
2 code implementations • 1 Mar 2021 • Francesco Croce, Matthias Hein
Finally, we combine $l_1$-APGD and an adaptation of the Square Attack to $l_1$ into $l_1$-AutoAttack, an ensemble of attacks which reliably assesses adversarial robustness for the threat model of $l_1$-ball intersected with $[0, 1]^d$.
1 code implementation • 19 Oct 2020 • Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, Matthias Hein
As a research community, we are still lacking a systematic understanding of the progress on adversarial robustness which often makes it hard to identify the most promising ideas in training robust models.
2 code implementations • 23 Jun 2020 • Francesco Croce, Maksym Andriushchenko, Naman D. Singh, Nicolas Flammarion, Matthias Hein
We propose a versatile framework based on random search, Sparse-RS, for score-based sparse targeted and untargeted attacks in the black-box setting.
10 code implementations • ICML 2020 • Francesco Croce, Matthias Hein
The field of defense strategies against adversarial attacks has significantly grown over the last years, but progress is hampered as the evaluation of adversarial defenses is often insufficient and thus gives a wrong impression of robustness.
1 code implementation • ECCV 2020 • Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, Matthias Hein
We propose the Square Attack, a score-based black-box $l_2$- and $l_\infty$-adversarial attack that does not rely on local gradient information and thus is not affected by gradient masking.
1 code implementation • ICCV 2019 • Francesco Croce, Matthias Hein
On the other hand the pixelwise perturbations of sparse attacks are typically large and thus can be potentially detected.
2 code implementations • ICML 2020 • Francesco Croce, Matthias Hein
The evaluation of robustness against adversarial manipulation of neural networks-based classifiers is mainly tested with empirical attacks as methods for the exact computation, even when available, do not scale to large networks.
1 code implementation • ICLR 2020 • Francesco Croce, Matthias Hein
In recent years several adversarial attacks and defenses have been proposed.
1 code implementation • 27 Mar 2019 • Francesco Croce, Jonas Rauber, Matthias Hein
Modern neural networks are highly non-robust against adversarial manipulation.
no code implementations • 28 Nov 2018 • Francesco Croce, Matthias Hein
Relatively fast heuristics have been proposed to produce these adversarial inputs but the problem of finding the optimal adversarial input, that is with the minimal change of the input, is NP-hard.
2 code implementations • 17 Oct 2018 • Francesco Croce, Maksym Andriushchenko, Matthias Hein
It has been shown that neural network classifiers are not robust.