no code implementations • NeurIPS 2018 • Hexiang Hu, Liyu Chen, Boqing Gong, Fei Sha
The ability to transfer in reinforcement learning is key towards building an agent of general artificial intelligence.
no code implementations • 11 Mar 2024 • Yufeng Zhang, Liyu Chen, Boyi Liu, Yingxiang Yang, Qiwen Cui, Yunzhe Tao, Hongxia Yang
Recent advances in reinforcement learning (RL) algorithms aim to enhance the performance of language models at scale.
no code implementations • 4 Oct 2023 • Zishun Yu, Yunzhe Tao, Liyu Chen, Tao Sun, Hongxia Yang
Despite policy-based RL methods dominating the literature on RL for program synthesis, the nature of program synthesis tasks hints at a natural alignment with value-based methods.
no code implementations • 7 Feb 2023 • Liyu Chen, Andrea Tirinzoni, Alessandro Lazaric, Matteo Pirotta
We leverage these results to design Layered Autonomous Exploration (LAE), a novel algorithm for AX that attains a sample complexity of $\tilde{\mathcal{O}}(LS^{\rightarrow}_{L(1+\epsilon)}\Gamma_{L(1+\epsilon)} A \ln^{12}(S^{\rightarrow}_{L(1+\epsilon)})/\epsilon^2)$, where $S^{\rightarrow}_{L(1+\epsilon)}$ is the number of states that are incrementally $L(1+\epsilon)$-controllable, $A$ is the number of actions, and $\Gamma_{L(1+\epsilon)}$ is the branching factor of the transitions over such states.
no code implementations • 10 Oct 2022 • Liyu Chen, Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric
We also initiate the study of learning $\epsilon$-optimal policies without access to a generative model (i. e., the so-called best-policy identification problem), and show that sample-efficient learning is impossible in general.
no code implementations • 26 May 2022 • Yan Dai, Haipeng Luo, Liyu Chen
More importantly, we then find two significant applications: First, the analysis of FTPL turns out to be readily generalizable to delayed bandit feedback with order-optimal regret, while OMD methods exhibit extra difficulties (Jin et al., 2022).
no code implementations • 25 May 2022 • Liyu Chen, Haipeng Luo
We initiate the study of dynamic regret minimization for goal-oriented reinforcement learning modeled by a non-stationary stochastic shortest path problem with changing cost and transition functions.
no code implementations • 16 Feb 2022 • Sebastien M. R. Arnold, Pierre L'Ecuyer, Liyu Chen, Yi-fan Chen, Fei Sha
Reinforcement learning constantly deals with hard integrals, for example when computing expectations in policy evaluation and policy iteration.
no code implementations • 7 Feb 2022 • Liyu Chen, Haipeng Luo, Aviv Rosenberg
Policy optimization is among the most popular and successful reinforcement learning algorithms, and there is increasing interest in understanding its theoretical guarantees.
no code implementations • 31 Jan 2022 • Liyu Chen, Rahul Jain, Haipeng Luo
We study regret minimization for infinite-horizon average-reward Markov Decision Processes (MDPs) under cost constraints.
no code implementations • NeurIPS 2021 • Liyu Chen, Mehdi Jafarnia-Jahromi, Rahul Jain, Haipeng Luo
We introduce a generic template for developing regret minimization algorithms in the Stochastic Shortest Path (SSP) model, which achieves minimax optimal regret as long as certain properties are ensured.
no code implementations • 9 Jun 2021 • Mehdi Jafarnia-Jahromi, Liyu Chen, Rahul Jain, Haipeng Luo
We consider the problem of online reinforcement learning for the Stochastic Shortest Path (SSP) problem modeled as an unknown MDP with an absorbing state.
no code implementations • 10 Feb 2021 • Liyu Chen, Haipeng Luo
Our work strictly improves (Rosenberg and Mansour, 2020) in the full information setting, extends (Chen et al., 2020) from known transition to unknown transition, and is also the first to consider the most challenging combination: bandit feedback with adversarial costs and unknown transition.
no code implementations • 1 Feb 2021 • Liyu Chen, Haipeng Luo, Chen-Yu Wei
We resolve the long-standing "impossible tuning" issue for the classic expert problem and show that, it is in fact possible to achieve regret $O\left(\sqrt{(\ln d)\sum_t \ell_{t, i}^2}\right)$ simultaneously for all expert $i$ in a $T$-round $d$-expert problem where $\ell_{t, i}$ is the loss for expert $i$ in round $t$.
no code implementations • 7 Dec 2020 • Liyu Chen, Haipeng Luo, Chen-Yu Wei
We study the stochastic shortest path problem with adversarial costs and known transition, and show that the minimax regret is $\widetilde{O}(\sqrt{DT^\star K})$ and $\widetilde{O}(\sqrt{DT^\star SA K})$ for the full-information setting and the bandit feedback setting respectively, where $D$ is the diameter, $T^\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes.
2 code implementations • NeurIPS 2018 • Hexiang Hu, Liyu Chen, Boqing Gong, Fei Sha
The ability to transfer in reinforcement learning is key towards building an agent of general artificial intelligence.