Search Results for author: Itamar Pres

Found 1 papers, 1 papers with code

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

1 code implementation • 3 Jan 2024 • Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, Rada Mihalcea

While alignment algorithms are now commonly used to tune pre-trained language models towards a user's preferences, we lack explanations for the underlying mechanisms in which models become ``aligned'', thus making it difficult to explain phenomena like jailbreaks.

Language Modelling

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.