no code implementations • 19 Apr 2024 • Jianliang He, Han Zhong, Zhuoran Yang
Moreover, for AMDPs, we propose a novel complexity measure -- average-reward generalized eluder coefficient (AGEC) -- which captures the challenge of exploration in AMDPs with general function approximation.
no code implementations • 4 Apr 2024 • Miao Lu, Han Zhong, Tong Zhang, Jose Blanchet
Unlike previous work, which relies on a generative model or a pre-collected offline dataset enjoying good coverage of the deployment environment, we tackle robust RL via interactive data collection, where the learner interacts with the training environment only and refines the policy through trial and error.
1 code implementation • 15 Feb 2024 • Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, Jianshu Chen
We consider the problem of multi-objective alignment of foundation models with human preferences, which is a critical step towards helpful and harmless AI systems.
no code implementations • 28 Dec 2023 • Guhao Feng, Han Zhong
We first demonstrate that, for a broad class of Markov decision processes (MDPs), the model can be represented by constant-depth circuits with polynomial size or Multi-Layer Perceptrons (MLPs) with constant layers and polynomial hidden dimension.
1 code implementation • 18 Dec 2023 • Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, Tong Zhang
This includes an iterative version of the Direct Preference Optimization (DPO) algorithm for online settings, and a multi-step rejection sampling strategy for offline scenarios.
no code implementations • 7 Dec 2023 • Jiayi Huang, Han Zhong, LiWei Wang, Lin F. Yang
To tackle long planning horizon problems in reinforcement learning with general function approximation, we propose the first algorithm, termed as UCRL-WVTR, that achieves both \emph{horizon-free} and \emph{instance-dependent}, since it eliminates the polynomial dependency on the planning horizon.
2 code implementations • 19 Oct 2023 • Rui Yang, Han Zhong, Jiawei Xu, Amy Zhang, Chongjie Zhang, Lei Han, Tong Zhang
Offline reinforcement learning (RL) presents a promising approach for learning reinforced policies from offline datasets without the need for costly or unsafe interactions with the environment.
no code implementations • NeurIPS 2023 • Jiayi Huang, Han Zhong, LiWei Wang, Lin F. Yang
Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the \emph{first} computationally efficient \emph{instance-dependent} $K$-episode regret of $\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+\epsilon} + d \sqrt{H \mathcal{V}^* K})$.
1 code implementation • NeurIPS 2023 • Zhihan Liu, Miao Lu, Wei Xiong, Han Zhong, Hao Hu, Shenao Zhang, Sirui Zheng, Zhuoran Yang, Zhaoran Wang
To achieve this, existing sample-efficient online RL algorithms typically consist of three components: estimation, planning, and exploration.
no code implementations • 21 Feb 2023 • Han Zhong, Jiachen Hu, Yecheng Xue, Tongyang Li, LiWei Wang
While quantum reinforcement learning (RL) has attracted a surge of attention recently, its theoretical understanding is limited.
no code implementations • NeurIPS 2023 • Yunchang Yang, Han Zhong, Tianhao Wu, Bin Liu, LiWei Wang, Simon S. Du
We study stochastic delayed feedback in general multi-agent sequential decision making, which includes bandits, single-agent Markov decision processes (MDPs), and Markov games (MGs).
no code implementations • 3 Nov 2022 • Han Zhong, Wei Xiong, Sirui Zheng, LiWei Wang, Zhaoran Wang, Zhuoran Yang, Tong Zhang
The proposed algorithm modifies the standard posterior sampling algorithm in two aspects: (i) we use an optimistic prior distribution that biases towards hypotheses with higher values and (ii) a loglikelihood function is set to be the empirical loss evaluated on the historical data, where the choice of loss function supports both model-free and model-based learning.
no code implementations • 27 Oct 2022 • Jiachen Hu, Han Zhong, Chi Jin, LiWei Wang
Sim-to-real transfer trains RL agents in the simulated environments and then deploys them in the real world.
no code implementations • 4 Oct 2022 • Wei Xiong, Han Zhong, Chengshuai Shi, Cong Shen, Tong Zhang
Existing studies on provably efficient algorithms for Markov games (MGs) almost exclusively build on the "optimism in the face of uncertainty" (OFU) principle.
no code implementations • 31 May 2022 • Wei Xiong, Han Zhong, Chengshuai Shi, Cong Shen, LiWei Wang, Tong Zhang
We also extend our techniques to the two-player zero-sum Markov games (MGs), and establish a new performance lower bound for MGs, which tightens the existing result, and verifies the nearly minimax optimality of the proposed algorithm.
no code implementations • 27 May 2022 • Binghui Li, Jikai Jin, Han Zhong, John E. Hopcroft, LiWei Wang
Moreover, we establish an improved upper bound of $\exp({\mathcal{O}}(k))$ for the network size to achieve low robust generalization error when the data lies on a manifold with intrinsic dimension $k$ ($k \ll d$).
no code implementations • 23 May 2022 • Xiaoyu Chen, Han Zhong, Zhuoran Yang, Zhaoran Wang, LiWei Wang
To the best of our knowledge, this is the first theoretical result for PbRL with (general) function approximation.
no code implementations • 15 Feb 2022 • Han Zhong, Wei Xiong, Jiyuan Tan, LiWei Wang, Tong Zhang, Zhaoran Wang, Zhuoran Yang
When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving.
no code implementations • 27 Dec 2021 • Han Zhong, Zhuoran Yang, Zhaoran Wang, Michael I. Jordan
We develop sample-efficient reinforcement learning (RL) algorithms for solving for an SNE in both online and offline settings.
no code implementations • 21 Dec 2021 • Tianhao Wu, Yunchang Yang, Han Zhong, LiWei Wang, Simon S. Du, Jiantao Jiao
Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms.
no code implementations • NeurIPS 2021 • Han Zhong, Jiayi Huang, Lin F. Yang, LiWei Wang
Despite a large amount of effort in dealing with heavy-tailed error in machine learning, little is known when moments of the error can become non-existential: the random noise $\eta$ satisfies Pr$\left[|\eta| > |y|\right] \le 1/|y|^{\alpha}$ for some $\alpha > 0$.
no code implementations • 18 Oct 2021 • Han Zhong, Zhuoran Yang, Zhaoran Wang, Csaba Szepesvári
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs).
no code implementations • 29 Sep 2021 • Han Zhong, Zhuoran Yang, Zhaoran Wang, Michael Jordan
To our best knowledge, we establish the first provably efficient RL algorithms for solving SNE in general-sum Markov games with leader-controlled state transitions.
no code implementations • ICLR 2022 • Yunchang Yang, Tianhao Wu, Han Zhong, Evrard Garcelon, Matteo Pirotta, Alessandro Lazaric, LiWei Wang, Simon S. Du
We also obtain a new upper bound for conservative low-rank MDP.
no code implementations • 28 Dec 2020 • Han Zhong, Xun Deng, Ethan X. Fang, Zhuoran Yang, Zhaoran Wang, Runze Li
In particular, we focus on a variance-constrained policy optimization problem where the goal is to find a policy that maximizes the expected value of the long-run average reward, subject to a constraint that the long-run variance of the average reward is upper bounded by a threshold.