Search Results for author: Rafael Rafailov

Found 20 papers, 11 papers with code

Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data

1 code implementation • 22 Apr 2024 • Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, Aviral Kumar

Our main finding is that, in general, approaches that use on-policy sampling or attempt to push down the likelihood on certain responses (i. e., employ a "negative gradient") outperform offline and maximum likelihood objectives.

Contrastive Learning Reinforcement Learning (RL)

Paper
Code

Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

1 code implementation • 22 Apr 2024 • Jan-Philipp Fränken, Eric Zelikman, Rafael Rafailov, Kanishk Gandhi, Tobias Gerstenberg, Noah D. Goodman

On single-turn dialogue and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained model, with win rates between 66% and 77%.

Language Modelling

Paper
Code

From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function

no code implementations • 18 Apr 2024 • Rafael Rafailov, Joey Hejna, Ryan Park, Chelsea Finn

Standard RLHF deploys reinforcement learning in a specific token-level MDP, while DPO is derived as a bandit problem in which the whole response of the model is treated as a single arm.

Language Modelling Q-Learning +1

Paper
Add Code

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

no code implementations • 1 Apr 2024 • Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo

The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs?

Image Generation

Paper
Add Code

Disentangling Length from Quality in Direct Preference Optimization

no code implementations • 28 Mar 2024 • Ryan Park, Rafael Rafailov, Stefano Ermon, Chelsea Finn

A number of approaches have been developed to control those biases in the classical RLHF literature, but the problem remains relatively under-explored for Direct Alignment Algorithms such as Direct Preference Optimization (DPO).

reinforcement-learning

Paper
Add Code

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

1 code implementation • 18 Feb 2024 • Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, Huaxiu Yao

This procedure is not perfect and can cause the model to hallucinate - provide answers that do not accurately reflect the image, even when the core LLM is highly factual and the vision backbone has sufficiently complete representations.

Hallucination Instruction Following +1

Paper
Code

MOTO: Offline Pre-training to Online Fine-tuning for Model-based Robot Learning

no code implementations • 6 Jan 2024 • Rafael Rafailov, Kyle Hatch, Victor Kolev, John D. Martin, Mariano Phielipp, Chelsea Finn

We study the problem of offline pre-training and online fine-tuning for reinforcement learning from high-dimensional observations in the context of realistic robot tasks.

Offline RL Robot Manipulation

Paper
Add Code

Diffusion Model Alignment Using Direct Preference Optimization

no code implementations • 21 Nov 2023 • Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik

Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users' preferences.

Paper
Add Code

Contrastive Preference Learning: Learning from Human Feedback without RL

1 code implementation • 20 Oct 2023 • Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh

Thus, learning a reward function from feedback is not only based on a flawed assumption of human preference, but also leads to unwieldy optimization challenges that stem from policy gradients or bootstrapping in the RL phase.

reinforcement-learning Reinforcement Learning (RL)

132

Paper
Code

An Emulator for Fine-Tuning Large Language Models using Small Language Models

1 code implementation • 19 Oct 2023 • Eric Mitchell, Rafael Rafailov, Archit Sharma, Chelsea Finn, Christopher D. Manning

To aid in doing so, we introduce a novel technique for decoupling the knowledge and skills gained in these two stages, enabling a direct answer to the question, "What would happen if we combined the knowledge learned by a large model during pre-training with the knowledge learned by a small model during fine-tuning (or vice versa)?"

Instruction Following

Paper
Code

Offline Retraining for Online RL: Decoupled Policy Learning to Mitigate Exploration Bias

1 code implementation • 12 Oct 2023 • Max Sobol Mark, Archit Sharma, Fahim Tajwar, Rafael Rafailov, Sergey Levine, Chelsea Finn

Can we leverage offline RL to recover better policies from online interaction?

D4RL Offline RL +2

Paper
Code

Contrastive Example-Based Control

1 code implementation • 24 Jul 2023 • Kyle Hatch, Benjamin Eysenbach, Rafael Rafailov, Tianhe Yu, Ruslan Salakhutdinov, Sergey Levine, Chelsea Finn

In this paper, we propose a method for offline, example-based control that learns an implicit model of multi-step transitions, rather than a reward function.

Offline RL

Paper
Code

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

14 code implementations • NeurIPS 2023 • Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF).

Language Modelling reinforcement-learning +1

8,108

Paper
Code

Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback

no code implementations • 24 May 2023 • Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, Christopher D. Manning

A trustworthy real-world prediction system should produce well-calibrated confidence scores; that is, its confidence in an answer should be indicative of the likelihood that the answer is correct, enabling deferral to an expert in cases of low-confidence predictions.

TriviaQA Unsupervised Pre-training

Paper
Add Code

Vision-Based Manipulators Need to Also See from Their Hands

no code implementations • ICLR 2022 • Kyle Hsu, Moo Jin Kim, Rafael Rafailov, Jiajun Wu, Chelsea Finn

We study how the choice of visual perspective affects learning and generalization in the context of physical manipulation from raw sensor observations.

Out-of-Distribution Generalization

Paper
Add Code

Visual Adversarial Imitation Learning using Variational Models

no code implementations • NeurIPS 2021 • Rafael Rafailov, Tianhe Yu, Aravind Rajeswaran, Chelsea Finn

We consider a setting where an agent is provided a fixed dataset of visual demonstrations illustrating how to perform a task, and must learn to solve the task using the provided demonstrations and unsupervised environment interactions.

Imitation Learning Representation Learning

Paper
Add Code

Variational Model-Based Imitation Learning in High-Dimensional Observation Spaces

no code implementations • ICLR Workshop SSL-RL 2021 • Rafael Rafailov, Tianhe Yu, Aravind Rajeswaran, Chelsea Finn

We consider the problem setting of imitation learning where the agent is provided a fixed dataset of demonstrations.

Imitation Learning Vocal Bursts Intensity Prediction

Paper
Add Code

COMBO: Conservative Offline Model-Based Policy Optimization

4 code implementations • NeurIPS 2021 • Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, Chelsea Finn

We overcome this limitation by developing a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-action tuples generated via rollouts under the learned model.

Offline RL Uncertainty Quantification

228

Paper
Code

Offline Reinforcement Learning from Images with Latent Space Models

1 code implementation • 21 Dec 2020 • Rafael Rafailov, Tianhe Yu, Aravind Rajeswaran, Chelsea Finn

In this work, we build on recent advances in model-based algorithms for offline RL, and extend them to high-dimensional visual observation spaces.

Offline RL reinforcement-learning +1

Paper
Code

Offline Meta-Reinforcement Learning with Advantage Weighting

2 code implementations • 13 Aug 2020 • Eric Mitchell, Rafael Rafailov, Xue Bin Peng, Sergey Levine, Chelsea Finn

That is, in offline meta-RL, we meta-train on fixed, pre-collected data from several tasks in order to adapt to a new task with a very small amount (less than 5 trajectories) of data from the new task.

Machine Translation Meta-Learning +5

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.