Methods > Reinforcement Learning > Value Function Estimation

Retrace

Introduced by Munos et al. in Safe and Efficient Off-Policy Reinforcement Learning

Retrace is an off-policy Q-value estimation algorithm which has guaranteed convergence for a target and behaviour policy $\left(\pi, \beta\right)$. With off-policy rollout for TD learning, we must use importance sampling for the update:

$$ \Delta{Q}^{\text{imp}}\left(S_{t}, A_{t}\right) = \gamma^{t}\prod_{1\leq{\tau}\leq{t}}\frac{\pi\left(A_{\tau}\mid{S_{\tau}}\right)}{\beta\left(A_{\tau}\mid{S_{\tau}}\right)}\delta_{t} $$

This product term can lead to high variance, so Retrace modifies $\Delta{Q}$ to have importance weights truncated by no more than a constant $c$:

$$ \Delta{Q}^{\text{imp}}\left(S_{t}, A_{t}\right) = \gamma^{t}\prod_{1\leq{\tau}\leq{t}}\min\left(c, \frac{\pi\left(A_{\tau}\mid{S_{\tau}}\right)}{\beta\left(A_{\tau}\mid{S_{\tau}}\right)}\right)\delta_{t} $$

Source: Safe and Efficient Off-Policy Reinforcement Learning

Latest Papers

PAPER DATE
Exploiting the potential of deep reinforcement learning for classification tasks in high-dimensional and unstructured data
Johan S. Obando-CeronVictor Romero CanoWalter Mayor Toro
2019-12-20
Learning Reward Machines for Partially Observable Reinforcement Learning
| Rodrigo Toro IcarteEthan WaldieToryn KlassenRick ValenzanoMargarita CastroSheila McIlraith
2019-12-01
Gap-Increasing Policy Evaluation for Efficient and Noise-Tolerant Reinforcement Learning
Tadashi KozunoDongqi HanKenji Doya
2019-06-18
Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Target
J. Fernando Hernandez-GarciaRichard S. Sutton
2019-01-22
Sample Efficient Deep Reinforcement Learning for Dialogue Systems with Large Action Spaces
Gellért WeiszPaweł BudzianowskiPei-Hao SuMilica Gašić
2018-02-11
Pretraining Deep Actor-Critic Reinforcement Learning Algorithms With Expert Demonstrations
Xiaoqin ZhangHuimin Ma
2018-01-31
The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning
Audrunas GruslysWill DabneyMohammad Gheshlaghi AzarBilal PiotMarc BellemareRemi Munos
2017-04-15
Sample Efficient Actor-Critic with Experience Replay
| Ziyu WangVictor BapstNicolas HeessVolodymyr MnihRemi MunosKoray KavukcuogluNando de Freitas
2016-11-03
Safe and Efficient Off-Policy Reinforcement Learning
| Rémi MunosTom StepletonAnna HarutyunyanMarc G. Bellemare
2016-06-08

Tasks

Components

COMPONENT TYPE
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories