On-Policy TD Control

Sarsa Lambda

Sarsa_INLINE_MATH_1 extends eligibility-traces to action-value methods. It has the same update rule as for TD_INLINE_MATH_1 but we use the action-value form of the TD erorr:

$$ \delta_{t} = R_{t+1} + \gamma\hat{q}\left(S_{t+1}, A_{t+1}, \mathbb{w}_{t}\right) - \hat{q}\left(S_{t}, A_{t}, \mathbb{w}_{t}\right) $$

and the action-value form of the eligibility trace:

$$ \mathbb{z}_{-1} = \mathbb{0} $$

$$ \mathbb{z}_{t} = \gamma\lambda\mathbb{z}_{t-1} + \nabla\hat{q}\left(S_{t}, A_{t}, \mathbb{w}_{t} \right), 0 \leq t \leq T$$

Source: Sutton and Barto, Reinforcement Learning, 2nd Edition

Papers


Paper Code Results Date Stars

Categories