Off-Policy TD Control

Q-Learning

Q-Learning is an off-policy temporal difference control algorithm:

$$Q\left(S_{t}, A_{t}\right) \leftarrow Q\left(S_{t}, A_{t}\right) + \alpha\left[R_{t+1} + \gamma\max_{a}Q\left(S_{t+1}, a\right) - Q\left(S_{t}, A_{t}\right)\right] $$

The learned action-value function $Q$ directly approximates $q_{*}$, the optimal action-value function, independent of the policy being followed.

Source: Sutton and Barto, Reinforcement Learning, 2nd Edition

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Reinforcement Learning (RL) 217 35.40%
Decision Making 48 7.83%
Multi-agent Reinforcement Learning 28 4.57%
Management 27 4.40%
Offline RL 25 4.08%
Atari Games 16 2.61%
OpenAI Gym 13 2.12%
Autonomous Driving 11 1.79%
Imitation Learning 11 1.79%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories