no code implementations • 27 Dec 2023 • Zaifan Jiang, Xing Huang, Chao Wei
Reinforcement Learning from Human Feedback (RLHF) is a model-based algorithm to optimize preference learning, which first fits a reward model for preference scores and then optimizes the generating policy with an on-policy PPO algorithm to maximize the reward.
1 code implementation • Geoscientific Model Development 2019 • Xiaomeng Huang, Xing Huang, Dong Wang, Qi Wu, Yi Li, Shixun Zhang, YuWen Chen, Mingqing Wang, Yuan Gao, Qiang Tang, Yue Chen, Zheng Fang, Zhenya Song, Guangwen Yang
In this work, we design a simple computing library to bridge the gap and decouple the work of ocean modeling from parallel computing.