Search Results for author: Xing Huang

Found 2 papers, 1 papers with code

Preference as Reward, Maximum Preference Optimization with Importance Sampling

no code implementations27 Dec 2023 Zaifan Jiang, Xing Huang, Chao Wei

Reinforcement Learning from Human Feedback (RLHF) is a model-based algorithm to optimize preference learning, which first fits a reward model for preference scores and then optimizes the generating policy with an on-policy PPO algorithm to maximize the reward.

Cannot find the paper you are looking for? You can Submit a new open access paper.