Search Results for author: Manas Joglekar

Found 1 papers, 0 papers with code

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

no code implementations • 14 Dec 2023 • Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, Jeff Wu

Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior - for example, to evaluate whether a model faithfully followed instructions or generated safe outputs.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.