no code implementations • 11 Jan 2024 • Yuanzhao Zhai, Yiying Li, Zijian Gao, Xudong Gong, Kele Xu, Dawei Feng, Ding Bo, Huaimin Wang
ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization.
Offline RL Reinforcement Learning (RL)