no code implementations • 16 Apr 2024 • Qiwei Di, Jiafan He, Quanquan Gu
Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM).
no code implementations • 14 Feb 2024 • Qiwei Di, Jiafan He, Dongruo Zhou, Quanquan Gu
Our algorithm achieves an $\tilde{\mathcal O}(dB_*\sqrt{K})$ regret bound, where $d$ is the dimension of the feature mapping in the linear transition kernel, $B_*$ is the upper bound of the total cumulative cost for the optimal policy, and $K$ is the number of episodes.
no code implementations • 2 Oct 2023 • Qiwei Di, Heyang Zhao, Jiafan He, Quanquan Gu
However, limited works on offline RL with non-linear function approximation have instance-dependent regret guarantees.
no code implementations • 2 Oct 2023 • Qiwei Di, Tao Jin, Yue Wu, Heyang Zhao, Farzad Farnoud, Quanquan Gu
Dueling bandits is a prominent framework for decision-making involving preferential feedback, a valuable feature that fits various applications involving human interaction, such as ranking, information retrieval, and recommendation systems.