Value Function Estimation

V-trace is an off-policy actor-critic reinforcement learning algorithm that helps tackle the lag between when actions are generated by the actors and when the learner estimates the gradient. Consider a trajectory $\left(x_{t}, a_{t}, r_{t}\right)^{t=s+n}_{t=s}$ generated by the actor following some policy $\mu$. We can define the $n$-steps V-trace target for $V\left(x_{s}\right)$, our value approximation at state $x_{s}$ as:

$$ v_{s} = V\left(x_{s}\right) + \sum^{s+n-1}_{t=s}\gamma^{t-s}\left(\prod^{t-1}_{i=s}c_{i}\right)\delta_{t}V $$

Where $\delta_{t}V = \rho_{t}\left(r_{t} + \gamma{V}\left(x_{t+1}\right) - V\left(x_{t}\right)\right)$ is a temporal difference algorithm for $V$, and $\rho_{t} = \text{min}\left(\bar{\rho}, \frac{\pi\left(a_{t}\mid{x_{t}}\right)}{\mu\left(a_{t}\mid{x_{t}}\right)}\right)$ and $c_{i} = \text{min}\left(\bar{c}, \frac{\pi\left(a_{t}\mid{x_{t}}\right)}{\mu\left(a_{t}\mid{x_{t}}\right)}\right)$ are truncated importance sampling weights. We assume that the truncation levels are such that $\bar{\rho} \geq \bar{c}$.

Source: IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Reinforcement Learning (RL) 15 31.91%
Starcraft 7 14.89%
Starcraft II 7 14.89%
Decision Making 3 6.38%
Atari Games 3 6.38%
Continuous Control 2 4.26%
OpenAI Gym 2 4.26%
Language Modelling 1 2.13%
Large Language Model 1 2.13%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories