no code implementations • 27 Dec 2023 • Zaifan Jiang, Xing Huang, Chao Wei
Reinforcement Learning from Human Feedback (RLHF) is a model-based algorithm to optimize preference learning, which first fits a reward model for preference scores and then optimizes the generating policy with an on-policy PPO algorithm to maximize the reward.
no code implementations • 3 Mar 2023 • Shuai Xiao, Zaifan Jiang, Shuang Yang
Finding optimal configurations in a geometric space is a key challenge in many technological disciplines.
no code implementations • 2 Mar 2023 • Shuai Xiao, Le Guo, Zaifan Jiang, Lei Lv, Yuanbo Chen, Jun Zhu, Shuang Yang
Furthermore, we show that the dual problem can be solved by policy learning, with the optimal dual variable being found efficiently via bisection search (i. e., by taking advantage of the monotonicity).
no code implementations • 3 Aug 2022 • Jiarui Jin, Xianyu Chen, Weinan Zhang, Yuanbo Chen, Zaifan Jiang, Zekun Zhu, Zhewen Su, Yong Yu
Modelling the user's multiple behaviors is an essential part of modern e-commerce, whose widely adopted application is to jointly optimize click-through rate (CTR) and conversion rate (CVR) predictions.
no code implementations • 9 Feb 2022 • Jiarui Jin, Xianyu Chen, Yuanbo Chen, Weinan Zhang, Renting Rui, Zaifan Jiang, Zhewen Su, Yong Yu
With the prevalence of live broadcast business nowadays, a new type of recommendation service, called live broadcast recommendation, is widely used in many mobile e-commerce Apps.