Search Results for author: Han Zhong

Found 25 papers, 4 papers with code

Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation

no code implementations19 Apr 2024 Jianliang He, Han Zhong, Zhuoran Yang

Moreover, for AMDPs, we propose a novel complexity measure -- average-reward generalized eluder coefficient (AGEC) -- which captures the challenge of exploration in AMDPs with general function approximation.

Distributionally Robust Reinforcement Learning with Interactive Data Collection: Fundamental Hardness and Near-Optimal Algorithm

no code implementations4 Apr 2024 Miao Lu, Han Zhong, Tong Zhang, Jose Blanchet

Unlike previous work, which relies on a generative model or a pre-collected offline dataset enjoying good coverage of the deployment environment, we tackle robust RL via interactive data collection, where the learner interacts with the training environment only and refines the policy through trial and error.

Reinforcement Learning (RL)

Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment

1 code implementation15 Feb 2024 Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, Jianshu Chen

We consider the problem of multi-objective alignment of foundation models with human preferences, which is a critical step towards helpful and harmless AI systems.

Reinforcement Learning (RL)

Rethinking Model-based, Policy-based, and Value-based Reinforcement Learning via the Lens of Representation Complexity

no code implementations28 Dec 2023 Guhao Feng, Han Zhong

We first demonstrate that, for a broad class of Markov decision processes (MDPs), the model can be represented by constant-depth circuits with polynomial size or Multi-Layer Perceptrons (MLPs) with constant layers and polynomial hidden dimension.

Reinforcement Learning (RL)

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

1 code implementation18 Dec 2023 Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, Tong Zhang

This includes an iterative version of the Direct Preference Optimization (DPO) algorithm for online settings, and a multi-step rejection sampling strategy for offline scenarios.

Language Modelling Large Language Model

Horizon-Free and Instance-Dependent Regret Bounds for Reinforcement Learning with General Function Approximation

no code implementations7 Dec 2023 Jiayi Huang, Han Zhong, LiWei Wang, Lin F. Yang

To tackle long planning horizon problems in reinforcement learning with general function approximation, we propose the first algorithm, termed as UCRL-WVTR, that achieves both \emph{horizon-free} and \emph{instance-dependent}, since it eliminates the polynomial dependency on the planning horizon.

regression

Towards Robust Offline Reinforcement Learning under Diverse Data Corruption

2 code implementations19 Oct 2023 Rui Yang, Han Zhong, Jiawei Xu, Amy Zhang, Chongjie Zhang, Lei Han, Tong Zhang

Offline reinforcement learning (RL) presents a promising approach for learning reinforced policies from offline datasets without the need for costly or unsafe interactions with the environment.

Offline RL Q-Learning +2

Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

no code implementations NeurIPS 2023 Jiayi Huang, Han Zhong, LiWei Wang, Lin F. Yang

Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the \emph{first} computationally efficient \emph{instance-dependent} $K$-episode regret of $\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+\epsilon} + d \sqrt{H \mathcal{V}^* K})$.

Reinforcement Learning (RL)

Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration

1 code implementation NeurIPS 2023 Zhihan Liu, Miao Lu, Wei Xiong, Han Zhong, Hao Hu, Shenao Zhang, Sirui Zheng, Zhuoran Yang, Zhaoran Wang

To achieve this, existing sample-efficient online RL algorithms typically consist of three components: estimation, planning, and exploration.

A Reduction-based Framework for Sequential Decision Making with Delayed Feedback

no code implementations NeurIPS 2023 Yunchang Yang, Han Zhong, Tianhao Wu, Bin Liu, LiWei Wang, Simon S. Du

We study stochastic delayed feedback in general multi-agent sequential decision making, which includes bandits, single-agent Markov decision processes (MDPs), and Markov games (MGs).

Decision Making

GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP, and Beyond

no code implementations3 Nov 2022 Han Zhong, Wei Xiong, Sirui Zheng, LiWei Wang, Zhaoran Wang, Zhuoran Yang, Tong Zhang

The proposed algorithm modifies the standard posterior sampling algorithm in two aspects: (i) we use an optimistic prior distribution that biases towards hypotheses with higher values and (ii) a loglikelihood function is set to be the empirical loss evaluated on the historical data, where the choice of loss function supports both model-free and model-based learning.

Decision Making Reinforcement Learning (RL)

Provable Sim-to-real Transfer in Continuous Domain with Partial Observations

no code implementations27 Oct 2022 Jiachen Hu, Han Zhong, Chi Jin, LiWei Wang

Sim-to-real transfer trains RL agents in the simulated environments and then deploys them in the real world.

A Self-Play Posterior Sampling Algorithm for Zero-Sum Markov Games

no code implementations4 Oct 2022 Wei Xiong, Han Zhong, Chengshuai Shi, Cong Shen, Tong Zhang

Existing studies on provably efficient algorithms for Markov games (MGs) almost exclusively build on the "optimism in the face of uncertainty" (OFU) principle.

Nearly Minimax Optimal Offline Reinforcement Learning with Linear Function Approximation: Single-Agent MDP and Markov Game

no code implementations31 May 2022 Wei Xiong, Han Zhong, Chengshuai Shi, Cong Shen, LiWei Wang, Tong Zhang

We also extend our techniques to the two-player zero-sum Markov games (MGs), and establish a new performance lower bound for MGs, which tightens the existing result, and verifies the nearly minimax optimality of the proposed algorithm.

Offline RL Reinforcement Learning (RL)

Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power

no code implementations27 May 2022 Binghui Li, Jikai Jin, Han Zhong, John E. Hopcroft, LiWei Wang

Moreover, we establish an improved upper bound of $\exp({\mathcal{O}}(k))$ for the network size to achieve low robust generalization error when the data lies on a manifold with intrinsic dimension $k$ ($k \ll d$).

Binary Classification

Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets

no code implementations15 Feb 2022 Han Zhong, Wei Xiong, Jiyuan Tan, LiWei Wang, Tong Zhang, Zhaoran Wang, Zhuoran Yang

When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving.

Can Reinforcement Learning Find Stackelberg-Nash Equilibria in General-Sum Markov Games with Myopic Followers?

no code implementations27 Dec 2021 Han Zhong, Zhuoran Yang, Zhaoran Wang, Michael I. Jordan

We develop sample-efficient reinforcement learning (RL) algorithms for solving for an SNE in both online and offline settings.

Reinforcement Learning (RL)

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

no code implementations21 Dec 2021 Tianhao Wu, Yunchang Yang, Han Zhong, LiWei Wang, Simon S. Du, Jiantao Jiao

Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms.

4k Reinforcement Learning (RL)

Breaking the Moments Condition Barrier: No-Regret Algorithm for Bandits with Super Heavy-Tailed Payoffs

no code implementations NeurIPS 2021 Han Zhong, Jiayi Huang, Lin F. Yang, LiWei Wang

Despite a large amount of effort in dealing with heavy-tailed error in machine learning, little is known when moments of the error can become non-existential: the random noise $\eta$ satisfies Pr$\left[|\eta| > |y|\right] \le 1/|y|^{\alpha}$ for some $\alpha > 0$.

Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs

no code implementations18 Oct 2021 Han Zhong, Zhuoran Yang, Zhaoran Wang, Csaba Szepesvári

We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs).

Reinforcement Learning (RL)

Can Reinforcement Learning Efficiently Find Stackelberg-Nash Equilibria in General-Sum Markov Games?

no code implementations29 Sep 2021 Han Zhong, Zhuoran Yang, Zhaoran Wang, Michael Jordan

To our best knowledge, we establish the first provably efficient RL algorithms for solving SNE in general-sum Markov games with leader-controlled state transitions.

Reinforcement Learning (RL)

Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy

no code implementations28 Dec 2020 Han Zhong, Xun Deng, Ethan X. Fang, Zhuoran Yang, Zhaoran Wang, Runze Li

In particular, we focus on a variance-constrained policy optimization problem where the goal is to find a policy that maximizes the expected value of the long-run average reward, subject to a constraint that the long-run variance of the average reward is upper bounded by a threshold.

reinforcement-learning Reinforcement Learning (RL)

Cannot find the paper you are looking for? You can Submit a new open access paper.