no code implementations • 22 Apr 2024 • Javier Rando, Francesco Croce, Kryštof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flammarion, Florian Tramèr
Large language models are aligned to be safe, preventing users from generating harmful content like misinformation or instructions for illegal activities.
1 code implementation • 2 Apr 2024 • Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion
We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks.
1 code implementation • 28 Mar 2024 • Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, Eric Wong
To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (2) a jailbreaking dataset comprising 100 behaviors -- both original and sourced from prior work -- which align with OpenAI's usage policies; (3) a standardized evaluation framework that includes a clearly defined threat model, system prompts, chat templates, and scoring functions; and (4) a leaderboard that tracks the performance of attacks and defenses for various LLMs.
1 code implementation • 7 Feb 2024 • Hao Zhao, Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion
There is a consensus that instruction fine-tuning of LLMs requires high-quality data, but what are they?
no code implementations • 20 Dec 2023 • Edoardo Debenedetti, Zishen Wan, Maksym Andriushchenko, Vikash Sehwag, Kshitij Bhardwaj, Bhavya Kailkhura
Finally, we make our benchmarking framework (built on top of \texttt{timm}~\citep{rw2019timm}) publicly available to facilitate future analysis in efficient robust deep learning.
1 code implementation • 29 Nov 2023 • Sungbin Shin, Dongyeop Lee, Maksym Andriushchenko, Namhoon Lee
Training an overparameterized neural network can yield minimizers of different generalization capabilities despite the same level of training loss.
1 code implementation • 6 Oct 2023 • Maksym Andriushchenko, Francesco D'Angelo, Aditya Varre, Nicolas Flammarion
In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory.
1 code implementation • 13 Jul 2023 • Linara Adilova, Maksym Andriushchenko, Michael Kamp, Asja Fischer, Martin Jaggi
Averaging neural network parameters is an intuitive method for fusing the knowledge of two independent models.
1 code implementation • NeurIPS 2023 • Klim Kireev, Maksym Andriushchenko, Carmela Troncoso, Nicolas Flammarion
We present a method that allows us to train adversarially robust deep networks for tabular data and to transfer this robustness to other classifiers via universal robust embeddings tailored to categorical data.
1 code implementation • 14 Feb 2023 • Maksym Andriushchenko, Francesco Croce, Maximilian Müller, Matthias Hein, Nicolas Flammarion
Overall, we observe that sharpness does not correlate well with generalization but rather with some training parameters like the learning rate that can be positively or negatively correlated with generalization depending on the setup.
1 code implementation • 11 Oct 2022 • Maksym Andriushchenko, Aditya Varre, Loucas Pillaud-Vivien, Nicolas Flammarion
We present empirical observations that commonly used large step sizes (i) lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics orthogonal to the bouncing directions that biases it implicitly toward sparse predictors.
1 code implementation • 13 Jun 2022 • Maksym Andriushchenko, Nicolas Flammarion
We further study the properties of the implicit bias on non-linear networks empirically, where we show that fine-tuning a standard model with SAM can lead to significant generalization improvements.
no code implementations • 25 Feb 2022 • Maksym Andriushchenko, Xiaoyang Rebecca Li, Geoffrey Oxholm, Thomas Gittings, Tu Bui, Nicolas Flammarion, John Collomosse
Finally, we show how to train an adversarially robust image comparator model for detecting editorial changes in matched images.
no code implementations • 29 Sep 2021 • Maksym Andriushchenko, Nicolas Flammarion
Next, we discuss why SAM can be helpful in the noisy label setting where we first show that it can help to improve generalization even for linear classifiers.
1 code implementation • 3 Mar 2021 • Klim Kireev, Maksym Andriushchenko, Nicolas Flammarion
First, we show that, when used with an appropriately selected perturbation radius, $\ell_p$ adversarial training can serve as a strong baseline against common corruptions improving both accuracy and calibration.
1 code implementation • 19 Oct 2020 • Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, Matthias Hein
As a research community, we are still lacking a systematic understanding of the progress on adversarial robustness which often makes it hard to identify the most promising ideas in training robust models.
1 code implementation • NeurIPS 2020 • Maksym Andriushchenko, Nicolas Flammarion
We show that adding a random step to FGSM, as proposed in Wong et al. (2020), does not prevent catastrophic overfitting, and that randomness is not important per se -- its main role being simply to reduce the magnitude of the perturbation.
2 code implementations • 23 Jun 2020 • Francesco Croce, Maksym Andriushchenko, Naman D. Singh, Nicolas Flammarion, Matthias Hein
We propose a versatile framework based on random search, Sparse-RS, for score-based sparse targeted and untargeted attacks in the black-box setting.
2 code implementations • ICLR 2021 • Marius Mosbach, Maksym Andriushchenko, Dietrich Klakow
Fine-tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks.
1 code implementation • ECCV 2020 • Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, Matthias Hein
We propose the Square Attack, a score-based black-box $l_2$- and $l_\infty$-adversarial attack that does not rely on local gradient information and thus is not affected by gradient masking.
1 code implementation • NeurIPS 2019 • Maksym Andriushchenko, Matthias Hein
The problem of adversarial robustness has been studied extensively for neural networks.
1 code implementation • CVPR 2019 • Matthias Hein, Maksym Andriushchenko, Julian Bitterwolf
We show that this technique is surprisingly effective in reducing the confidence of predictions far away from the training data while maintaining high confidence predictions and test error on the original classification task compared to standard training.
1 code implementation • 29 Oct 2018 • Marius Mosbach, Maksym Andriushchenko, Thomas Trost, Matthias Hein, Dietrich Klakow
Recently, Kannan et al. [2018] proposed several logit regularization methods to improve the adversarial robustness of classifiers.
2 code implementations • 17 Oct 2018 • Francesco Croce, Maksym Andriushchenko, Matthias Hein
It has been shown that neural network classifiers are not robust.
no code implementations • NeurIPS 2017 • Matthias Hein, Maksym Andriushchenko
We show in this paper for the first time formal guarantees on the robustness of a classifier by giving instance-specific lower bounds on the norm of the input manipulation required to change the classifier decision.