Search Results for author: Florian Tramer

Found 17 papers, 9 papers with code

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

no code implementations • 15 Apr 2024 • Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Yoshua Bengio, Danqi Chen, Samuel Albanie, Tegan Maharaj, Jakob Foerster, Florian Tramer, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger

This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs).

Paper
Add Code

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

1 code implementation • 28 Mar 2024 • Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, Eric Wong

To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (2) a jailbreaking dataset comprising 100 behaviors -- both original and sourced from prior work -- which align with OpenAI's usage policies; (3) a standardized evaluation framework that includes a clearly defined threat model, system prompts, chat templates, and scoring functions; and (4) a leaderboard that tracks the performance of attacks and defenses for various LLMs.

Paper
Code

Increasing Confidence in Adversarial Robustness Evaluations

no code implementations • 28 Jun 2022 • Roland S. Zimmermann, Wieland Brendel, Florian Tramer, Nicholas Carlini

Hundreds of defenses have been proposed to make deep neural networks robust against minimal (adversarial) input perturbations.

Adversarial Robustness

Paper
Add Code

(Certified!!) Adversarial Robustness for Free!

1 code implementation • 21 Jun 2022 • Nicholas Carlini, Florian Tramer, Krishnamurthy Dj Dvijotham, Leslie Rice, MingJie Sun, J. Zico Kolter

In this paper we show how to achieve state-of-the-art certified adversarial robustness to 2-norm bounded perturbations by relying exclusively on off-the-shelf pretrained models.

Adversarial Robustness Denoising

Paper
Code

The Privacy Onion Effect: Memorization is Relative

no code implementations • 21 Jun 2022 • Nicholas Carlini, Matthew Jagielski, Chiyuan Zhang, Nicolas Papernot, Andreas Terzis, Florian Tramer

Machine learning models trained on private datasets have been shown to leak their private data.

Machine Unlearning Memorization

Paper
Add Code

Debugging Differential Privacy: A Case Study for Privacy Auditing

no code implementations • 24 Feb 2022 • Florian Tramer, Andreas Terzis, Thomas Steinke, Shuang Song, Matthew Jagielski, Nicholas Carlini

Differential Privacy can provide provable privacy guarantees for training data in machine learning.

Paper
Add Code

Quantifying Memorization Across Neural Language Models

2 code implementations • 15 Feb 2022 • Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, Chiyuan Zhang

Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized training data verbatim.

Fairness Memorization

244

Paper
Code

Membership Inference Attacks From First Principles

2 code implementations • 7 Dec 2021 • Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, Florian Tramer

A membership inference attack allows an adversary to query a trained machine learning model to predict whether or not a particular example was contained in the model's training dataset.

Inference Attack Membership Inference Attack

539

Paper
Code

Data Poisoning Won’t Save You From Facial Recognition

no code implementations • ICML Workshop AML 2021 • Evani Radiya-Dixit, Florian Tramer

We demonstrate that this strategy provides a false sense of security, as it ignores an inherent asymmetry between the parties: users' pictures are perturbed once and for all before being published and scraped, and must thereafter fool all future models---including models trained adaptively against the users' past attacks, or models that use technologies discovered after the attack.

Data Poisoning

Paper
Add Code

Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them

no code implementations • ICML Workshop AML 2021 • Florian Tramer

We prove a general \emph{hardness reduction} between detection and classification of adversarial examples: given a robust detector for attacks at distance $\epsilon$ (in some metric), we can build a similarly robust (but inefficient) \emph{classifier} for attacks at distance $\epsilon/2$.

Paper
Add Code

Extracting Training Data from Large Language Models

3 code implementations • 14 Dec 2020 • Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, Colin Raffel

We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data.

Language Modelling

154

Paper
Code

Is Private Learning Possible with Instance Encoding?

2 code implementations • 10 Nov 2020 • Nicholas Carlini, Samuel Deng, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Shuang Song, Abhradeep Thakurta, Florian Tramer

A private machine learning algorithm hides as much as possible about its training data while still preserving accuracy.

BIG-bench Machine Learning

Paper
Code

Label-Only Membership Inference Attacks

1 code implementation • 28 Jul 2020 • Christopher A. Choquette-Choo, Florian Tramer, Nicholas Carlini, Nicolas Papernot

We empirically show that our label-only membership inference attacks perform on par with prior attacks that required access to model confidences.

L2 Regularization

Paper
Code

On Adaptive Attacks to Adversarial Example Defenses

4 code implementations • NeurIPS 2020 • Florian Tramer, Nicholas Carlini, Wieland Brendel, Aleksander Madry

Adaptive attacks have (rightfully) become the de facto standard for evaluating defenses to adversarial examples.

Paper
Code

SquirRL: Automating Attack Discovery on Blockchain Incentive Mechanisms with Deep Reinforcement Learning

1 code implementation • 4 Dec 2019 • Charlie Hou, Mingxun Zhou, Yan Ji, Phil Daian, Florian Tramer, Giulia Fanti, Ari Juels

Incentive mechanisms are central to the functionality of permissionless blockchains: they incentivize participants to run and secure the underlying consensus protocol.

Cryptography and Security

Paper
Code

Physical Adversarial Examples for Object Detectors

no code implementations • 20 Jul 2018 • Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Florian Tramer, Atul Prakash, Tadayoshi Kohno, Dawn Song

In this work, we extend physical attacks to more challenging object detection models, a broader class of deep learning algorithms widely used to detect and label multiple objects within a scene.

Object object-detection +1

Paper
Add Code

Note on Attacking Object Detectors with Adversarial Stickers

no code implementations • 21 Dec 2017 • Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Dawn Song, Tadayoshi Kohno, Amir Rahmati, Atul Prakash, Florian Tramer

Given the fact that state-of-the-art objection detection algorithms are harder to be fooled by the same set of adversarial examples, here we show that these detectors can also be attacked by physical adversarial examples.

Object

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.