Policy Optimization With Penalized Point Probability Distance: An Alternative To Proximal Policy Optimization

As the most successful variant and improvement for Trust Region Policy Optimization (TRPO), proximal policy optimization (PPO) has been widely applied across various domains with several advantages: efficient data utilization, easy implementation, and good parallelism. In this paper, a first-order gradient reinforcement learning algorithm called Policy Optimization with Penalized Point Probability Distance (POP3D), which is a lower bound to the square of total variance divergence is proposed as another powerful variant... (read more)

Results in Papers With Code
(↓ scroll down to see all results)