1 code implementation • 18 Apr 2024 • Tomasz Korbak
In Chapter 3, I investigate the relation between two approaches to finetuning pretrained LMs using feedback given by a scoring function: reinforcement learning from human feedback (RLHF) and distribution matching.
1 code implementation • 15 Apr 2024 • Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Yoshua Bengio, Danqi Chen, Samuel Albanie, Tegan Maharaj, Jakob Foerster, Florian Tramer, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs).
no code implementations • 1 Apr 2024 • Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo
The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs?
1 code implementation • 20 Oct 2023 • Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez
Overall, our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.
1 code implementation • 17 Oct 2023 • Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Marc Dymetman
As language models (LMs) become more capable, it is increasingly important to align them with human preferences.
2 code implementations • 21 Sep 2023 • Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans
If a model is trained on a sentence of the form "A is B", it will not automatically generalize to the reverse direction "B is A".
1 code implementation • 1 Sep 2023 • Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, Owain Evans
At test time, we assess whether the model can pass the test.
no code implementations • 27 Jul 2023 • Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals.
no code implementations • 15 Jun 2023 • Ian R. McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, Andrew Gritsevskiy, Daniel Wurgaft, Derik Kauffman, Gabriel Recchia, Jiacheng Liu, Joe Cavanagh, Max Weiss, Sicong Huang, The Floating Droid, Tom Tseng, Tomasz Korbak, Xudong Shen, Yuhui Zhang, Zhengping Zhou, Najoung Kim, Samuel R. Bowman, Ethan Perez
Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e. g., due to flaws in the training objective and data.
1 code implementation • 28 Mar 2023 • Angelica Chen, Jérémy Scheurer, Tomasz Korbak, Jon Ander Campos, Jun Shern Chan, Samuel R. Bowman, Kyunghyun Cho, Ethan Perez
The potential for pre-trained large language models (LLMs) to use natural language feedback at inference time has been an exciting recent development.
1 code implementation • 28 Mar 2023 • Jérémy Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, Ethan Perez
Third, finetuning the language model to maximize the likelihood of the chosen refinement given the input.
no code implementations • 8 Mar 2023 • Julian Zubek, Tomasz Korbak, Joanna Rączaszek-Leonardi
Computational simulations are a popular method for testing hypotheses about the emergence of communication.
1 code implementation • 16 Feb 2023 • Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Nahyeon Ryu, Marc Dymetman
We show that Jensen-Shannon divergence strikes a good balance between these objectives, and frequently outperforms forward KL divergence by a wide margin, leading to significant improvements over prior work.
1 code implementation • 16 Feb 2023 • Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L. Buckley, Jason Phang, Samuel R. Bowman, Ethan Perez
Language models (LMs) are pretrained to imitate internet text, including content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, and more.
2 code implementations • 1 Jun 2022 • Tomasz Korbak, Hady Elsahar, Germán Kruszewski, Marc Dymetman
Here we explore the theoretical connections between the two paradigms, and show that methods such as KL-control developed for RM can also be construed as belonging to DM.
no code implementations • 23 May 2022 • Tomasz Korbak, Ethan Perez, Christopher L Buckley
We show that KL-regularised RL is equivalent to variational inference: approximating a Bayesian posterior which specifies how to update a prior LM to conform with evidence provided by the reward function.
no code implementations • 18 Jan 2022 • Anil Seth, Tomasz Korbak, Alexander Tschantz
Bruineberg and colleagues helpfully distinguish between instrumental and ontological interpretations of Markov blankets, exposing the dangers of using the former to make claims about the latter.
1 code implementation • 1 Dec 2021 • Tomasz Korbak, Hady Elsahar, German Kruszewski, Marc Dymetman
Machine learning is shifting towards general-purpose pretrained generative models, trained in a self-supervised manner on large amounts of data, which can then be applied to solve a large number of tasks.
no code implementations • NeurIPS 2021 • Łukasz Kuciński, Tomasz Korbak, Paweł Kołodziej, Piotr Miłoś
Communication is compositional if complex signals can be represented as a combination of simpler subparts.
no code implementations • 29 Sep 2021 • Tomasz Korbak, Hady Elsahar, Germán Kruszewski, Marc Dymetman
The availability of large pre-trained models is changing the landscape of Machine Learning research and practice, moving from a "training from scratch" to a "fine-tuning'' paradigm.
1 code implementation • 9 Jun 2021 • Tomasz Korbak, Hady Elsahar, Marc Dymetman, Germán Kruszewski
Neural language models can be successfully trained on source code, leading to applications such as code completion.
1 code implementation • 28 Oct 2020 • Tomasz Korbak, Julian Zubek, Joanna Rączaszek-Leonardi
Compositionality is an important explanatory target in emergent communication and language evolution.
1 code implementation • 4 Oct 2019 • Tomasz Korbak, Julian Zubek, Łukasz Kuciński, Piotr Miłoś, Joanna Rączaszek-Leonardi
This paper explores a novel approach to achieving emergent compositional communication in multi-agent systems.
no code implementations • 17 Jun 2019 • Renard Korzeniowski, Rafał Rolczyński, Przemysław Sadownik, Tomasz Korbak, Marcin Możejko
This paper presents our contribution to PolEval 2019 Task 6: Hate speech and bullying detection.
1 code implementation • 3 Nov 2017 • Tomasz Korbak, Paulina Żak
We describe a variant of Child-Sum Tree-LSTM deep neural network (Tai et al, 2015) fine-tuned for working with dependency trees and morphologically rich languages using the example of Polish.