no code implementations • 20 Feb 2024 • Runlong Zhou, Simon S. Du, Beibin Li
We propose Reflect-RL, a two-player system to fine-tune an LM using online RL, where a frozen reflection model assists the policy model.
1 code implementation • 30 Oct 2023 • Zhaoyi Zhou, Chuning Zhu, Runlong Zhou, Qiwen Cui, Abhishek Gupta, Simon Shaolei Du
Off-policy dynamic programming (DP) techniques such as $Q$-learning have proven to be important in sequential decision-making problems.
no code implementations • 31 Jan 2023 • Runlong Zhou, Zihan Zhang, Simon S. Du
We further initiate the study on model-free algorithms with variance-dependent regret bounds by designing a reference-function-based algorithm with a novel capped-doubling reference update schedule.
no code implementations • 20 Oct 2022 • Runlong Zhou, Ruosong Wang, Simon S. Du
We complement our positive result with a novel $\Omega(\sqrt{\mathsf{Var}^\star M S A K})$ regret lower bound with $\Gamma = 2$, which shows our upper bound minimax optimal when $\Gamma$ is a constant for the class of variance-bounded LMDPs.
1 code implementation • 11 Feb 2022 • Runlong Zhou, Zelin He, Yuandong Tian, Yi Wu, Simon S. Du
Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift, a critical quantity that governs the convergence rate in our theorem.
no code implementations • NeurIPS 2021 • Jean Tarbouriech, Runlong Zhou, Simon S. Du, Matteo Pirotta, Michal Valko, Alessandro Lazaric
We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state.