Hellinger Distance Constrained Regression

1 Jan 2021 · Egor Rotinov ·

This paper introduces the off-policy reinforcement learning method that uses the Hellinger distance between sampling policy and current policy as a constraint. Hellinger distance squared multiplied by two is greater than or equal to total variation distance squared and less than or equal to Kullback-Leibler divergence, therefore lower bound for expected discounted return for the new policy is improved comparing to KL. Also, Hellinger distance is less than or equal to 1, so there is a policy-independent lower bound for expected discounted return. HDCR is capable of training with Experience Replay, a common setting for distributed RL when collecting trajectories using different policies and learning from this data centralized. HDCR shows results comparable to or better than Advantage-weighted Behavior Model and Advantage-Weighted Regression on MuJoCo tasks using offline datasets collected by random agents and datasets obtained during the first iterations of online training of HDCR agent.

PDF Abstract