Markov Chain Monte Carlo Policy Optimization
Discovering approximately optimal policies in domains is crucial to applying reinforcement learning (RL) in many real-world scenarios, which is termed as policy optimization. By viewing the policy optimization from the perspective of variational inference, the representation power of policy network allows us to obtain the approximate posterior of actions conditioned on the states, with the entropy or KL regularization. However, in practice the policy optimization may lead to suboptimal policy estimates due to amortization gap. Inspired by the Markov Chain Monte Carlo (MCMC) techniques, instead of optimizing policy parameters or policy distributions directly, we propose a new policy optimization method, incorporating gradient-based feedback in various ways. The empirical evaluation verifies the performance improvement of the proposed method in many continuous control benchmarks.
PDF Abstract