CrossNorm: Normalization for Off-Policy TD Reinforcement Learning

14 Feb 2019  ·  Aditya Bhatt, Max Argus, Artemij Amiranashvili, Thomas Brox ·

Off-policy temporal difference (TD) methods are a powerful class of reinforcement learning (RL) algorithms. Intriguingly, deep off-policy TD algorithms are not commonly used in combination with feature normalization techniques, despite positive effects of normalization in other domains... We show that naive application of existing normalization techniques is indeed not effective, but that well-designed normalization improves optimization stability and removes the necessity of target networks. In particular, we introduce a normalization based on a mixture of on- and off-policy transitions, which we call cross-normalization. It can be regarded as an extension of batch normalization that re-centers data for two different distributions, as present in off-policy learning. Applied to DDPG and TD3, cross-normalization improves over the state of the art across a range of MuJoCo benchmark tasks. read more

PDF Abstract
No code implementations yet. Submit your code now



Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.