Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning

We present DrQ-v2, a model-free reinforcement learning (RL) algorithm for visual continuous control. DrQ-v2 builds on DrQ, an off-policy actor-critic approach that uses data augmentation to learn directly from pixels. We introduce several improvements that yield state-of-the-art results on the DeepMind Control Suite. Notably, DrQ-v2 is able to solve complex humanoid locomotion tasks directly from pixel observations, previously unattained by model-free RL. DrQ-v2 is conceptually simple, easy to implement, and provides significantly better computational footprint compared to prior work, with the majority of tasks taking just 8 hours to train on a single GPU. Finally, we publicly release DrQ-v2's implementation to provide RL practitioners with a strong and computationally efficient baseline.

PDF Abstract ICLR 2022 PDF ICLR 2022 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Unsupervised Reinforcement Learning URLB (pixels, 10^5 frames) DDPG (DrQ-v2) Walker (mean normalized return) 14.18±8.68 # 6
Quadruped (mean normalized return) 25.07±7.80 # 2
Jaco (mean normalized return) 15.33±4.29 # 3
Unsupervised Reinforcement Learning URLB (pixels, 10^6 frames) DDPG (DrQ-v2) Walker (mean normalized return) 14.18±8.68 # 6
Quadruped (mean normalized return) 25.07±7.80 # 4
Jaco (mean normalized return) 15.33±4.29 # 5
Unsupervised Reinforcement Learning URLB (pixels, 2*10^6 frames) DDPG (DrQ-v2) Walker (mean normalized return) 14.18±8.68 # 7
Quadruped (mean normalized return) 25.07±7.80 # 5
Jaco (mean normalized return) 15.33±4.29 # 6
Unsupervised Reinforcement Learning URLB (pixels, 5*10^5 frames) DDPG (DrQ-v2) Walker (mean normalized return) 14.18±8.68 # 6
Quadruped (mean normalized return) 25.07±7.80 # 4
Jaco (mean normalized return) 15.33±4.29 # 5
Unsupervised Reinforcement Learning URLB (states, 10^5 frames) DDPG (DrQ-v2) Walker (mean normalized return) 73.68±31.29 # 7
Quadruped (mean normalized return) 28.33±9.01 # 7
Jaco (mean normalized return) 49.14±8.22 # 6
Unsupervised Reinforcement Learning URLB (states, 10^6 frames) DDPG (DrQ-v2) Walker (mean normalized return) 73.68±31.29 # 7
Quadruped (mean normalized return) 28.33±9.01 # 9
Jaco (mean normalized return) 49.14±8.22 # 7
Unsupervised Reinforcement Learning URLB (states, 2*10^6 frames) DDPG (DrQ-v2) Walker (mean normalized return) 73.68±31.29 # 5
Quadruped (mean normalized return) 22.63±8.29 # 9
Jaco (mean normalized return) 49.14±8.22 # 6
Unsupervised Reinforcement Learning URLB (states, 5*10^5 frames) DDPG (DrQ-v2) Walker (mean normalized return) 73.68±31.29 # 7
Quadruped (mean normalized return) 28.33±9.01 # 9
Jaco (mean normalized return) 49.14±8.22 # 7

Methods


No methods listed for this paper. Add relevant methods here