no code implementations • 14 Dec 2023 • Tony T. Wang, Miles Wang, Kaivalya Hariharan, Nir Shavit
LLMs often face competing pressures (for example helpfulness vs. harmlessness).
no code implementations • 14 Feb 2023 • Tony T. Wang, Igor Zablotchi, Nir Shavit, Jonathan S. Rosenfeld
We conduct an in-depth investigation of foundation-model cliff-learning and study toy models of the phenomenon.
2 code implementations • 1 Nov 2022 • Tony T. Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D. Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell
The core vulnerability uncovered by our attack persists even in KataGo agents adversarially trained to defend against our attack.