As the most successful variant and improvement for Trust Region Policy
Optimization (TRPO), proximal policy optimization (PPO) has been widely applied
across various domains with several advantages: efficient data utilization,
easy implementation, and good parallelism. In this paper, a first-order
gradient reinforcement learning algorithm called Policy Optimization with
Penalized Point Probability Distance (POP3D), which is a lower bound to the
square of total variance divergence is proposed as another powerful variant...
Firstly, we talk about the shortcomings of several commonly used algorithms, by
which our method is partly motivated. Secondly, we address to overcome these
shortcomings by applying POP3D. Thirdly, we dive into its mechanism from the
perspective of solution manifold. Finally, we make quantitative comparisons
among several state-of-the-art algorithms based on common benchmarks. Simulation results show that POP3D is highly competitive compared with PPO. Besides, our code is released in https://github.com/paperwithcode/pop3d.