Search Results for author: Maksym Andriushchenko

Found 25 papers, 20 papers with code

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

no code implementations22 Apr 2024 Javier Rando, Francesco Croce, Kryštof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flammarion, Florian Tramèr

Large language models are aligned to be safe, preventing users from generating harmful content like misinformation or instructions for illegal activities.

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

1 code implementation2 Apr 2024 Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion

We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks.

In-Context Learning

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

1 code implementation28 Mar 2024 Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, Eric Wong

To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (2) a jailbreaking dataset comprising 100 behaviors -- both original and sourced from prior work -- which align with OpenAI's usage policies; (3) a standardized evaluation framework that includes a clearly defined threat model, system prompts, chat templates, and scoring functions; and (4) a leaderboard that tracks the performance of attacks and defenses for various LLMs.

Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning

1 code implementation7 Feb 2024 Hao Zhao, Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion

There is a consensus that instruction fine-tuning of LLMs requires high-quality data, but what are they?

Scaling Compute Is Not All You Need for Adversarial Robustness

no code implementations20 Dec 2023 Edoardo Debenedetti, Zishen Wan, Maksym Andriushchenko, Vikash Sehwag, Kshitij Bhardwaj, Bhavya Kailkhura

Finally, we make our benchmarking framework (built on top of \texttt{timm}~\citep{rw2019timm}) publicly available to facilitate future analysis in efficient robust deep learning.

Adversarial Robustness Benchmarking

Analyzing Sharpness-aware Minimization under Overparameterization

1 code implementation29 Nov 2023 Sungbin Shin, Dongyeop Lee, Maksym Andriushchenko, Namhoon Lee

Training an overparameterized neural network can yield minimizers of different generalization capabilities despite the same level of training loss.

Why Do We Need Weight Decay in Modern Deep Learning?

1 code implementation6 Oct 2023 Maksym Andriushchenko, Francesco D'Angelo, Aditya Varre, Nicolas Flammarion

In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory.

Learning Theory Stochastic Optimization

Layer-wise Linear Mode Connectivity

1 code implementation13 Jul 2023 Linara Adilova, Maksym Andriushchenko, Michael Kamp, Asja Fischer, Martin Jaggi

Averaging neural network parameters is an intuitive method for fusing the knowledge of two independent models.

Federated Learning Linear Mode Connectivity

Transferable Adversarial Robustness for Categorical Data via Universal Robust Embeddings

1 code implementation NeurIPS 2023 Klim Kireev, Maksym Andriushchenko, Carmela Troncoso, Nicolas Flammarion

We present a method that allows us to train adversarially robust deep networks for tabular data and to transfer this robustness to other classifiers via universal robust embeddings tailored to categorical data.

Adversarial Robustness Fraud Detection +2

A Modern Look at the Relationship between Sharpness and Generalization

1 code implementation14 Feb 2023 Maksym Andriushchenko, Francesco Croce, Maximilian Müller, Matthias Hein, Nicolas Flammarion

Overall, we observe that sharpness does not correlate well with generalization but rather with some training parameters like the learning rate that can be positively or negatively correlated with generalization depending on the setup.

SGD with Large Step Sizes Learns Sparse Features

1 code implementation11 Oct 2022 Maksym Andriushchenko, Aditya Varre, Loucas Pillaud-Vivien, Nicolas Flammarion

We present empirical observations that commonly used large step sizes (i) lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics orthogonal to the bouncing directions that biases it implicitly toward sparse predictors.

Towards Understanding Sharpness-Aware Minimization

1 code implementation13 Jun 2022 Maksym Andriushchenko, Nicolas Flammarion

We further study the properties of the implicit bias on non-linear networks empirically, where we show that fine-tuning a standard model with SAM can lead to significant generalization improvements.

Understanding Sharpness-Aware Minimization

no code implementations29 Sep 2021 Maksym Andriushchenko, Nicolas Flammarion

Next, we discuss why SAM can be helpful in the noisy label setting where we first show that it can help to improve generalization even for linear classifiers.

Learning with noisy labels

On the effectiveness of adversarial training against common corruptions

1 code implementation3 Mar 2021 Klim Kireev, Maksym Andriushchenko, Nicolas Flammarion

First, we show that, when used with an appropriately selected perturbation radius, $\ell_p$ adversarial training can serve as a strong baseline against common corruptions improving both accuracy and calibration.

Data Augmentation

RobustBench: a standardized adversarial robustness benchmark

1 code implementation19 Oct 2020 Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, Matthias Hein

As a research community, we are still lacking a systematic understanding of the progress on adversarial robustness which often makes it hard to identify the most promising ideas in training robust models.

Adversarial Robustness Benchmarking +3

Understanding and Improving Fast Adversarial Training

1 code implementation NeurIPS 2020 Maksym Andriushchenko, Nicolas Flammarion

We show that adding a random step to FGSM, as proposed in Wong et al. (2020), does not prevent catastrophic overfitting, and that randomness is not important per se -- its main role being simply to reduce the magnitude of the perturbation.

Sparse-RS: a versatile framework for query-efficient sparse black-box adversarial attacks

2 code implementations23 Jun 2020 Francesco Croce, Maksym Andriushchenko, Naman D. Singh, Nicolas Flammarion, Matthias Hein

We propose a versatile framework based on random search, Sparse-RS, for score-based sparse targeted and untargeted attacks in the black-box setting.

Malware Detection

On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines

2 code implementations ICLR 2021 Marius Mosbach, Maksym Andriushchenko, Dietrich Klakow

Fine-tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks.

Misconceptions

Square Attack: a query-efficient black-box adversarial attack via random search

1 code implementation ECCV 2020 Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, Matthias Hein

We propose the Square Attack, a score-based black-box $l_2$- and $l_\infty$-adversarial attack that does not rely on local gradient information and thus is not affected by gradient masking.

Adversarial Attack

Why ReLU networks yield high-confidence predictions far away from the training data and how to mitigate the problem

1 code implementation CVPR 2019 Matthias Hein, Maksym Andriushchenko, Julian Bitterwolf

We show that this technique is surprisingly effective in reducing the confidence of predictions far away from the training data while maintaining high confidence predictions and test error on the original classification task compared to standard training.

General Classification

Logit Pairing Methods Can Fool Gradient-Based Attacks

1 code implementation29 Oct 2018 Marius Mosbach, Maksym Andriushchenko, Thomas Trost, Matthias Hein, Dietrich Klakow

Recently, Kannan et al. [2018] proposed several logit regularization methods to improve the adversarial robustness of classifiers.

Adversarial Robustness

Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation

no code implementations NeurIPS 2017 Matthias Hein, Maksym Andriushchenko

We show in this paper for the first time formal guarantees on the robustness of a classifier by giving instance-specific lower bounds on the norm of the input manipulation required to change the classifier decision.

General Classification

Cannot find the paper you are looking for? You can Submit a new open access paper.