no code implementations • 15 Jan 2024 • Shuze Liu, Shuhang Chen, Shangtong Zhang
Stochastic approximation is a class of algorithms that update a vector iteratively, incrementally, and stochastically, including, e. g., stochastic gradient descent and temporal difference learning.
1 code implementation • 7 Aug 2023 • Michaël Mathieu, Sherjil Ozair, Srivatsan Srinivasan, Caglar Gulcehre, Shangtong Zhang, Ray Jiang, Tom Le Paine, Richard Powell, Konrad Żołna, Julian Schrittwieser, David Choi, Petko Georgiev, Daniel Toyama, Aja Huang, Roman Ring, Igor Babuschkin, Timo Ewalds, Mahyar Bordbar, Sarah Henderson, Sergio Gómez Colmenarejo, Aäron van den Oord, Wojciech Marian Czarnecki, Nando de Freitas, Oriol Vinyals
StarCraft II is one of the most challenging simulated reinforcement learning environments; it is partially observable, stochastic, multi-agent, and mastering StarCraft II requires strategic planning over long time horizons with real-time low-level execution.
no code implementations • 2 Aug 2023 • Xiaochi Qian, Shangtong Zhang
Gradient Temporal Difference (GTD) is one powerful tool to solve the deadly triad.
no code implementations • 31 Jan 2023 • Shuze Liu, Shangtong Zhang
Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome.
no code implementations • 14 Feb 2022 • Shangtong Zhang, Remi Tachet, Romain Laroche
SARSA, a classical on-policy control algorithm for reinforcement learning, is known to chatter when combined with linear function approximation: SARSA does not diverge but oscillates in a bounded region.
1 code implementation • NeurIPS 2023 • Shangtong Zhang, Remi Tachet, Romain Laroche
In this paper, we establish the global optimality and convergence rate of an off-policy actor critic algorithm in the tabular setting without using density ratio to correct the discrepancy between the state distribution of the behavior policy and that of the target policy.
1 code implementation • 11 Aug 2021 • Shangtong Zhang, Shimon Whiteson
Despite the theoretical success of emphatic TD methods in addressing the notorious deadly triad of off-policy RL, there are still two open problems.
no code implementations • 12 Jul 2021 • Ray Jiang, Shangtong Zhang, Veronica Chelu, Adam White, Hado van Hasselt
We develop a multi-step emphatic weighting that can be combined with replay, and a time-reversed $n$-step TD learning algorithm to learn the required emphatic weighting.
1 code implementation • 21 Jan 2021 • Shangtong Zhang, Hengshuai Yao, Shimon Whiteson
The deadly triad refers to the instability of a reinforcement learning algorithm when it employs off-policy learning, function approximation, and bootstrapping simultaneously.
1 code implementation • 8 Jan 2021 • Shangtong Zhang, Yi Wan, Richard S. Sutton, Shimon Whiteson
We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function.
1 code implementation • 2 Oct 2020 • Shangtong Zhang, Romain Laroche, Harm van Seijen, Shimon Whiteson, Remi Tachet des Combes
In the second scenario, we consider optimizing a discounted objective ($\gamma < 1$) and propose to interpret the omission of the discounting in the actor update from an auxiliary task perspective and provide supporting empirical results.
1 code implementation • NeurIPS 2020 • Shangtong Zhang, Vivek Veeriah, Shimon Whiteson
We present a Reverse Reinforcement Learning (Reverse RL) approach for representing retrospective knowledge.
1 code implementation • 22 Apr 2020 • Shangtong Zhang, Bo Liu, Shimon Whiteson
We present a mean-variance policy iteration (MVPI) framework for risk-averse control in a discounted infinite horizon MDP optimizing the variance of a per-step reward random variable.
1 code implementation • ICML 2020 • Shangtong Zhang, Bo Liu, Shimon Whiteson
Namely, the optimization problem in GenDICE is not a convex-concave saddle-point problem once nonlinearity in optimization variable parameterization is introduced to ensure positivity, so any primal-dual algorithm is not guaranteed to converge or find the desired solution.
1 code implementation • ICML 2020 • Shangtong Zhang, Bo Liu, Hengshuai Yao, Shimon Whiteson
With the help of the emphasis critic and the canonical value function critic, we show convergence for COF-PAC, where the critics are linear and the actor can be nonlinear.
no code implementations • 13 May 2019 • Borislav Mavrin, Shangtong Zhang, Hengshuai Yao, Linglong Kong, Kaiwen Wu, Yao-Liang Yu
In distributional reinforcement learning (RL), the estimated distribution of value function models both the parametric and intrinsic uncertainties.
1 code implementation • 12 May 2019 • Yuhang Song, Jianyi Wang, Thomas Lukasiewicz, Zhenghua Xu, Shangtong Zhang, Andrzej Wojcicki, Mai Xu
Intrinsic rewards were introduced to simulate how human intelligence works; they are usually evaluated by intrinsically-motivated play, i. e., playing games without extrinsic rewards but evaluated with extrinsic rewards.
1 code implementation • 3 May 2019 • Shangtong Zhang, Wendelin Boehmer, Shimon Whiteson
We revisit residual algorithms in both model-free and model-based reinforcement learning settings.
Model-based Reinforcement Learning reinforcement-learning +1
1 code implementation • NeurIPS 2019 • Shangtong Zhang, Shimon Whiteson
We reformulate the option framework as two parallel augmented MDPs.
1 code implementation • NeurIPS 2019 • Shangtong Zhang, Wendelin Boehmer, Shimon Whiteson
We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting.
1 code implementation • 6 Nov 2018 • Shangtong Zhang, Hao Chen, Hengshuai Yao
In this paper, we propose an actor ensemble algorithm, named ACE, for continuous control with a deterministic policy in reinforcement learning.
3 code implementations • 5 Nov 2018 • Shangtong Zhang, Borislav Mavrin, Linglong Kong, Bo Liu, Hengshuai Yao
In this paper, we propose the Quantile Option Architecture (QUOTA) for exploration based on recent advances in distributional reinforcement learning (RL).
1 code implementation • Journal of Open Source Software 2018 • Ryan R. Curtin, Marcus Edel, Mikhail Lozhnikov, Yannis Mentekidis, Sumedh Ghaisas, Shangtong Zhang
In the past several years, the field of machine learning has seen an explosion of interest and excitement, with hundreds or thousands of algorithms developed for different tasks every year.
4 code implementations • 4 Dec 2017 • Shangtong Zhang, Richard S. Sutton
Recently experience replay is widely used in various deep reinforcement learning (RL) algorithms, in this paper we rethink the utility of experience replay.
no code implementations • 30 Nov 2017 • Shangtong Zhang, Osmar R. Zaiane
Reinforcement Learning and the Evolutionary Strategy are two major approaches in addressing complicated control problems.
no code implementations • 9 Dec 2016 • Vivek Veeriah, Shangtong Zhang, Richard S. Sutton
In this paper, we introduce a new incremental learning algorithm called crossprop, which learns incoming weights of hidden units based on the meta-gradient descent approach, that was previously introduced by Sutton (1992) and Schraudolph (1999) for learning step-sizes.