Search Results for author: Alex D'Amour

Found 4 papers, 1 papers with code

Transforming and Combining Rewards for Aligning Large Language Models

no code implementations • 1 Feb 2024 • ZiHao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D'Amour, Sanmi Koyejo, Victor Veitch

A common approach for aligning language models to human preferences is to first learn a reward model from preference data, and then use this reward model to update the language model.

Language Modelling

Paper
Add Code

Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking

no code implementations • 14 Dec 2023 • Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D'Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, Peter Shaw, Jonathan Berant

However, even pretrain reward ensembles do not eliminate reward hacking: we show several qualitative reward hacking phenomena that are not mitigated by ensembling because all reward models in the ensemble exhibit similar error patterns.

Language Modelling

Paper
Add Code

Detecting Underspecification with Local Ensembles

2 code implementations • ICLR 2020 • David Madras, James Atwood, Alex D'Amour

We present local ensembles, a method for detecting underspecification -- when many possible predictors are consistent with the training data and model class -- at test time in a pre-trained model.

Active Learning Out-of-Distribution Detection

Paper
Code

BriarPatches: Pixel-Space Interventions for Inducing Demographic Parity

no code implementations • 17 Dec 2018 • Alexey A. Gritsenko, Alex D'Amour, James Atwood, Yoni Halpern, D. Sculley

We introduce the BriarPatch, a pixel-space intervention that obscures sensitive attributes from representations encoded in pre-trained classifiers.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.