Search Results for author: Tony T. Wang

Forbidden Facts: An Investigation of Competing Objectives in Llama-2

LLMs often face competing pressures (for example helpfulness vs. harmlessness).

Paper
Add Code

We conduct an in-depth investigation of foundation-model cliff-learning and study toy models of the phenomenon.

Paper
Add Code

The core vulnerability uncovered by our attack persists even in KataGo agents adversarially trained to defend against our attack.

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.