Search Results for author: Shangtong Zhang

Found 26 papers, 18 papers with code

The ODE Method for Stochastic Approximation and Reinforcement Learning with Markovian Noise

no code implementations • 15 Jan 2024 • Shuze Liu, Shuhang Chen, Shangtong Zhang

Stochastic approximation is a class of algorithms that update a vector iteratively, incrementally, and stochastically, including, e. g., stochastic gradient descent and temporal difference learning.

reinforcement-learning

Paper
Add Code

AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning

1 code implementation • 7 Aug 2023 • Michaël Mathieu, Sherjil Ozair, Srivatsan Srinivasan, Caglar Gulcehre, Shangtong Zhang, Ray Jiang, Tom Le Paine, Richard Powell, Konrad Żołna, Julian Schrittwieser, David Choi, Petko Georgiev, Daniel Toyama, Aja Huang, Roman Ring, Igor Babuschkin, Timo Ewalds, Mahyar Bordbar, Sarah Henderson, Sergio Gómez Colmenarejo, Aäron van den Oord, Wojciech Marian Czarnecki, Nando de Freitas, Oriol Vinyals

StarCraft II is one of the most challenging simulated reinforcement learning environments; it is partially observable, stochastic, multi-agent, and mastering StarCraft II requires strategic planning over long time horizons with real-time low-level execution.

Offline RL reinforcement-learning +2

341

Paper
Code

Direct Gradient Temporal Difference Learning

no code implementations • 2 Aug 2023 • Xiaochi Qian, Shangtong Zhang

Gradient Temporal Difference (GTD) is one powerful tool to solve the deadly triad.

reinforcement-learning Reinforcement Learning (RL)

Paper
Add Code

Improving Monte Carlo Evaluation with Offline Data

no code implementations • 31 Jan 2023 • Shuze Liu, Shangtong Zhang

Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome.

Management

Paper
Add Code

On the Convergence of SARSA with Linear Function Approximation

no code implementations • 14 Feb 2022 • Shangtong Zhang, Remi Tachet, Romain Laroche

SARSA, a classical on-policy control algorithm for reinforcement learning, is known to chatter when combined with linear function approximation: SARSA does not diverge but oscillates in a bounded region.

Paper
Add Code

Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor Critic under State Distribution Mismatch

1 code implementation • NeurIPS 2023 • Shangtong Zhang, Remi Tachet, Romain Laroche

In this paper, we establish the global optimality and convergence rate of an off-policy actor critic algorithm in the tabular setting without using density ratio to correct the discrepancy between the state distribution of the behavior policy and that of the target policy.

Policy Gradient Methods

3,095

Paper
Code

Truncated Emphatic Temporal Difference Methods for Prediction and Control

1 code implementation • 11 Aug 2021 • Shangtong Zhang, Shimon Whiteson

Despite the theoretical success of emphatic TD methods in addressing the notorious deadly triad of off-policy RL, there are still two open problems.

Reinforcement Learning (RL)

3,095

Paper
Code

Learning Expected Emphatic Traces for Deep RL

no code implementations • 12 Jul 2021 • Ray Jiang, Shangtong Zhang, Veronica Chelu, Adam White, Hado van Hasselt

We develop a multi-step emphatic weighting that can be combined with replay, and a time-reversed $n$-step TD learning algorithm to learn the required emphatic weighting.

Paper
Add Code

Breaking the Deadly Triad with a Target Network

1 code implementation • 21 Jan 2021 • Shangtong Zhang, Hengshuai Yao, Shimon Whiteson

The deadly triad refers to the instability of a reinforcement learning algorithm when it employs off-policy learning, function approximation, and bootstrapping simultaneously.

Q-Learning

3,095

Paper
Code

Average-Reward Off-Policy Policy Evaluation with Function Approximation

1 code implementation • 8 Jan 2021 • Shangtong Zhang, Yi Wan, Richard S. Sutton, Shimon Whiteson

We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function.

3,095

Paper
Code

A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms

1 code implementation • 2 Oct 2020 • Shangtong Zhang, Romain Laroche, Harm van Seijen, Shimon Whiteson, Remi Tachet des Combes

In the second scenario, we consider optimizing a discounted objective ($\gamma < 1$) and propose to interpret the omission of the discounting in the actor update from an auxiliary task perspective and provide supporting empirical results.

Representation Learning

3,095

Paper
Code

Learning Retrospective Knowledge with Reverse Reinforcement Learning

1 code implementation • NeurIPS 2020 • Shangtong Zhang, Vivek Veeriah, Shimon Whiteson

We present a Reverse Reinforcement Learning (Reverse RL) approach for representing retrospective knowledge.

Anomaly Detection reinforcement-learning +2

3,095

Paper
Code

Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning

1 code implementation • 22 Apr 2020 • Shangtong Zhang, Bo Liu, Shimon Whiteson

We present a mean-variance policy iteration (MVPI) framework for risk-averse control in a discounted infinite horizon MDP optimizing the variance of a per-step reward random variable.

reinforcement-learning Reinforcement Learning (RL)

3,095

Paper
Code

GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values

1 code implementation • ICML 2020 • Shangtong Zhang, Bo Liu, Shimon Whiteson

Namely, the optimization problem in GenDICE is not a convex-concave saddle-point problem once nonlinearity in optimization variable parameterization is introduced to ensure positivity, so any primal-dual algorithm is not guaranteed to converge or find the desired solution.

3,095

Paper
Code

Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation

1 code implementation • ICML 2020 • Shangtong Zhang, Bo Liu, Hengshuai Yao, Shimon Whiteson

With the help of the emphasis critic and the canonical value function critic, we show convergence for COF-PAC, where the critics are linear and the actor can be nonlinear.

Vocal Bursts Valence Prediction

3,095

Paper
Code

Distributional Reinforcement Learning for Efficient Exploration

no code implementations • 13 May 2019 • Borislav Mavrin, Shangtong Zhang, Hengshuai Yao, Linglong Kong, Kaiwen Wu, Yao-Liang Yu

In distributional reinforcement learning (RL), the estimated distribution of value function models both the parametric and intrinsic uncertainties.

Atari Games Distributional Reinforcement Learning +3

Paper
Add Code

Mega-Reward: Achieving Human-Level Play without Extrinsic Rewards

1 code implementation • 12 May 2019 • Yuhang Song, Jianyi Wang, Thomas Lukasiewicz, Zhenghua Xu, Shangtong Zhang, Andrzej Wojcicki, Mai Xu

Intrinsic rewards were introduced to simulate how human intelligence works; they are usually evaluated by intrinsically-motivated play, i. e., playing games without extrinsic rewards but evaluated with extrinsic rewards.

Paper
Code

Deep Residual Reinforcement Learning

1 code implementation • 3 May 2019 • Shangtong Zhang, Wendelin Boehmer, Shimon Whiteson

We revisit residual algorithms in both model-free and model-based reinforcement learning settings.

Model-based Reinforcement Learning reinforcement-learning +1

3,095

Paper
Code

DAC: The Double Actor-Critic Architecture for Learning Options

1 code implementation • NeurIPS 2019 • Shangtong Zhang, Shimon Whiteson

We reformulate the option framework as two parallel augmented MDPs.

Transfer Learning

3,095

Paper
Code

Generalized Off-Policy Actor-Critic

1 code implementation • NeurIPS 2019 • Shangtong Zhang, Wendelin Boehmer, Shimon Whiteson

We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting.

counterfactual reinforcement-learning +1

3,095

Paper
Code

ACE: An Actor Ensemble Algorithm for Continuous Control with Tree Search

1 code implementation • 6 Nov 2018 • Shangtong Zhang, Hao Chen, Hengshuai Yao

In this paper, we propose an actor ensemble algorithm, named ACE, for continuous control with a deterministic policy in reinforcement learning.

Continuous Control reinforcement-learning +2

3,095

Paper
Code

QUOTA: The Quantile Option Architecture for Reinforcement Learning

3 code implementations • 5 Nov 2018 • Shangtong Zhang, Borislav Mavrin, Linglong Kong, Bo Liu, Hengshuai Yao

In this paper, we propose the Quantile Option Architecture (QUOTA) for exploration based on recent advances in distributional reinforcement learning (RL).

Decision Making Distributional Reinforcement Learning +2

3,095

Paper
Code

mlpack 3: a fast, flexible machine learning library

1 code implementation • Journal of Open Source Software 2018 • Ryan R. Curtin, Marcus Edel, Mikhail Lozhnikov, Yannis Mentekidis, Sumedh Ghaisas, Shangtong Zhang

In the past several years, the field of machine learning has seen an explosion of interest and excitement, with hundreds or thousands of algorithms developed for different tasks every year.

Benchmarking BIG-bench Machine Learning +1

4,794

Paper
Code

A Deeper Look at Experience Replay

4 code implementations • 4 Dec 2017 • Shangtong Zhang, Richard S. Sutton

Recently experience replay is widely used in various deep reinforcement learning (RL) algorithms, in this paper we rethink the utility of experience replay.

Atari Games reinforcement-learning +1

Paper
Code

Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control

no code implementations • 30 Nov 2017 • Shangtong Zhang, Osmar R. Zaiane

Reinforcement Learning and the Evolutionary Strategy are two major approaches in addressing complicated control problems.

Continuous Control reinforcement-learning +1

Paper
Add Code

Learning Representations by Stochastic Meta-Gradient Descent in Neural Networks

no code implementations • 9 Dec 2016 • Vivek Veeriah, Shangtong Zhang, Richard S. Sutton

In this paper, we introduce a new incremental learning algorithm called crossprop, which learns incoming weights of hidden units based on the meta-gradient descent approach, that was previously introduced by Sutton (1992) and Schraudolph (1999) for learning step-sizes.

Incremental Learning

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.