Search Results for author: Tomasz Korbak

Found 25 papers, 16 papers with code

Aligning language models with human preferences

1 code implementation • 18 Apr 2024 • Tomasz Korbak

In Chapter 3, I investigate the relation between two approaches to finetuning pretrained LMs using feedback given by a scoring function: reinforcement learning from human feedback (RLHF) and distribution matching.

Bayesian Inference

166

Paper
Code

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

1 code implementation • 15 Apr 2024 • Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Yoshua Bengio, Danqi Chen, Samuel Albanie, Tegan Maharaj, Jakob Foerster, Florian Tramer, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger

This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs).

Paper
Code

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

no code implementations • 1 Apr 2024 • Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo

The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs?

Image Generation

Paper
Add Code

Towards Understanding Sycophancy in Language Models

1 code implementation • 20 Oct 2023 • Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez

Overall, our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.

Text Generation

Paper
Code

Compositional preference models for aligning LMs

1 code implementation • 17 Oct 2023 • Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Marc Dymetman

As language models (LMs) become more capable, it is increasingly important to align them with human preferences.

Paper
Code

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

2 code implementations • 21 Sep 2023 • Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans

If a model is trained on a sentence of the form "A is B", it will not automatically generalize to the reverse direction "B is A".

Data Augmentation Sentence

246

Paper
Code

Taken out of context: On measuring situational awareness in LLMs

1 code implementation • 1 Sep 2023 • Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, Owain Evans

At test time, we assess whether the model can pass the test.

Data Augmentation In-Context Learning

Paper
Code

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

no code implementations • 27 Jul 2023 • Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals.

reinforcement-learning

Paper
Add Code

Inverse Scaling: When Bigger Isn't Better

no code implementations • 15 Jun 2023 • Ian R. McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, Andrew Gritsevskiy, Daniel Wurgaft, Derik Kauffman, Gabriel Recchia, Jiacheng Liu, Joe Cavanagh, Max Weiss, Sicong Huang, The Floating Droid, Tom Tseng, Tomasz Korbak, Xudong Shen, Yuhui Zhang, Zhengping Zhou, Najoung Kim, Samuel R. Bowman, Ethan Perez

Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e. g., due to flaws in the training objective and data.

Paper
Add Code

Improving Code Generation by Training with Natural Language Feedback

1 code implementation • 28 Mar 2023 • Angelica Chen, Jérémy Scheurer, Tomasz Korbak, Jon Ander Campos, Jun Shern Chan, Samuel R. Bowman, Kyunghyun Cho, Ethan Perez

The potential for pre-trained large language models (LLMs) to use natural language feedback at inference time has been an exciting recent development.

Code Generation Imitation Learning +1

Paper
Code

Training Language Models with Language Feedback at Scale

1 code implementation • 28 Mar 2023 • Jérémy Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, Ethan Perez

Third, finetuning the language model to maximize the likelihood of the chosen refinement given the input.

Bayesian Inference Imitation Learning +1

Paper
Code

Models of symbol emergence in communication: a conceptual review and a guide for avoiding local minima

no code implementations • 8 Mar 2023 • Julian Zubek, Tomasz Korbak, Joanna Rączaszek-Leonardi

Computational simulations are a popular method for testing hypotheses about the emergence of communication.

Descriptive

Paper
Add Code

Aligning Language Models with Preferences through f-divergence Minimization

1 code implementation • 16 Feb 2023 • Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Nahyeon Ryu, Marc Dymetman

We show that Jensen-Shannon divergence strikes a good balance between these objectives, and frequently outperforms forward KL divergence by a wide margin, leading to significant improvements over prior work.

Paper
Code

Pretraining Language Models with Human Preferences

1 code implementation • 16 Feb 2023 • Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L. Buckley, Jason Phang, Samuel R. Bowman, Ethan Perez

Language models (LMs) are pretrained to imitate internet text, including content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, and more.

Imitation Learning Language Modelling

166

Paper
Code

On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting

2 code implementations • 1 Jun 2022 • Tomasz Korbak, Hady Elsahar, Germán Kruszewski, Marc Dymetman

Here we explore the theoretical connections between the two paradigms, and show that methods such as KL-control developed for RM can also be construed as belonging to DM.

Language Modelling Reinforcement Learning (RL) +1

116

Paper
Code

RL with KL penalties is better viewed as Bayesian inference

no code implementations • 23 May 2022 • Tomasz Korbak, Ethan Perez, Christopher L Buckley

We show that KL-regularised RL is equivalent to variational inference: approximating a Bayesian posterior which specifies how to update a prior LM to conform with evidence provided by the reward function.

Bayesian Inference Language Modelling +2

Paper
Add Code

A continuity of Markov blanket interpretations under the Free Energy Principle

no code implementations • 18 Jan 2022 • Anil Seth, Tomasz Korbak, Alexander Tschantz

Bruineberg and colleagues helpfully distinguish between instrumental and ontological interpretations of Markov blankets, exposing the dangers of using the former to make claims about the latter.

Paper
Add Code

Controlling Conditional Language Models without Catastrophic Forgetting

1 code implementation • 1 Dec 2021 • Tomasz Korbak, Hady Elsahar, German Kruszewski, Marc Dymetman

Machine learning is shifting towards general-purpose pretrained generative models, trained in a self-supervised manner on large amounts of data, which can then be applied to solve a large number of tasks.

Abstractive Text Summarization Code Generation

116

Paper
Code

Catalytic Role Of Noise And Necessity Of Inductive Biases In The Emergence Of Compositional Communication

no code implementations • NeurIPS 2021 • Łukasz Kuciński, Tomasz Korbak, Paweł Kołodziej, Piotr Miłoś

Communication is compositional if complex signals can be represented as a combination of simpler subparts.

Paper
Add Code

On Reward Maximization and Distribution Matching for Fine-Tuning Language Models

no code implementations • 29 Sep 2021 • Tomasz Korbak, Hady Elsahar, Germán Kruszewski, Marc Dymetman

The availability of large pre-trained models is changing the landscape of Machine Learning research and practice, moving from a "training from scratch" to a "fine-tuning'' paradigm.

Language Modelling Reinforcement Learning (RL) +1

Paper
Add Code

Energy-Based Models for Code Generation under Compilability Constraints

1 code implementation • 9 Jun 2021 • Tomasz Korbak, Hady Elsahar, Marc Dymetman, Germán Kruszewski

Neural language models can be successfully trained on source code, leading to applications such as code completion.

Code Completion Code Generation

Paper
Code

Measuring non-trivial compositionality in emergent communication

1 code implementation • 28 Oct 2020 • Tomasz Korbak, Julian Zubek, Joanna Rączaszek-Leonardi

Compositionality is an important explanatory target in emergent communication and language evolution.

Paper
Code

Developmentally motivated emergence of compositional communication via template transfer

1 code implementation • 4 Oct 2019 • Tomasz Korbak, Julian Zubek, Łukasz Kuciński, Piotr Miłoś, Joanna Rączaszek-Leonardi

This paper explores a novel approach to achieving emergent compositional communication in multi-agent systems.

Zero-shot Generalization

Paper
Code

Exploiting Unsupervised Pre-training and Automated Feature Engineering for Low-resource Hate Speech Detection in Polish

no code implementations • 17 Jun 2019 • Renard Korzeniowski, Rafał Rolczyński, Przemysław Sadownik, Tomasz Korbak, Marcin Możejko

This paper presents our contribution to PolEval 2019 Task 6: Hate speech and bullying detection.

Automated Feature Engineering Classification +5

Paper
Add Code

Fine-tuning Tree-LSTM for phrase-level sentiment classification on a Polish dependency treebank. Submission to PolEval task 2

1 code implementation • 3 Nov 2017 • Tomasz Korbak, Paulina Żak

We describe a variant of Child-Sum Tree-LSTM deep neural network (Tai et al, 2015) fine-tuned for working with dependency trees and morphologically rich languages using the example of Polish.

General Classification Sentiment Analysis +3

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.