no code implementations • 1 Feb 2024 • ZiHao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D'Amour, Sanmi Koyejo, Victor Veitch
A common approach for aligning language models to human preferences is to first learn a reward model from preference data, and then use this reward model to update the language model.
no code implementations • 14 Dec 2023 • Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D'Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, Peter Shaw, Jonathan Berant
However, even pretrain reward ensembles do not eliminate reward hacking: we show several qualitative reward hacking phenomena that are not mitigated by ensembling because all reward models in the ensemble exhibit similar error patterns.
2 code implementations • ICLR 2020 • David Madras, James Atwood, Alex D'Amour
We present local ensembles, a method for detecting underspecification -- when many possible predictors are consistent with the training data and model class -- at test time in a pre-trained model.
no code implementations • 17 Dec 2018 • Alexey A. Gritsenko, Alex D'Amour, James Atwood, Yoni Halpern, D. Sculley
We introduce the BriarPatch, a pixel-space intervention that obscures sensitive attributes from representations encoded in pre-trained classifiers.