Proximal Policy Optimization

Introduced by Schulman et al. in Proximal Policy Optimization Algorithms

Proximal Policy Optimization, or PPO, is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of TRPO, while using only first-order optimization.

Let $r_{t}\left(\theta\right)$ denote the probability ratio $r_{t}\left(\theta\right) = \frac{\pi_{\theta}\left(a_{t}\mid{s_{t}}\right)}{\pi_{\theta_{old}}\left(a_{t}\mid{s_{t}}\right)}$, so $r\left(\theta_{old}\right) = 1$. TRPO maximizes a “surrogate” objective:

$$L^{\text{CPI}}\left({\theta}\right) = \hat{\mathbb{E}}_{t}\left[\frac{\pi_{\theta}\left(a_{t}\mid{s_{t}}\right)}{\pi_{\theta_{old}}\left(a_{t}\mid{s_{t}}\right)})\hat{A}_{t}\right] = \hat{\mathbb{E}}_{t}\left[r_{t}\left(\theta\right)\hat{A}_{t}\right]$$

Where $CPI$ refers to a conservative policy iteration. Without a constraint, maximization of $L^{CPI}$ would lead to an excessively large policy update; hence, we PPO modifies the objective, to penalize changes to the policy that move $r_{t}\left(\theta\right)$ away from 1:

$$J^{\text{CLIP}}\left({\theta}\right) = \hat{\mathbb{E}}_{t}\left[\min\left(r_{t}\left(\theta\right)\hat{A}_{t}, \text{clip}\left(r_{t}\left(\theta\right), 1-\epsilon, 1+\epsilon\right)\hat{A}_{t}\right)\right]$$

where $\epsilon$ is a hyperparameter, say, $\epsilon = 0.2$. The motivation for this objective is as follows. The first term inside the min is $L^{CPI}$. The second term, $\text{clip}\left(r_{t}\left(\theta\right), 1-\epsilon, 1+\epsilon\right)\hat{A}_{t}$ modifies the surrogate objective by clipping the probability ratio, which removes the incentive for moving $r_{t}$ outside of the interval $\left[1 − \epsilon, 1 + \epsilon\right]$. Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective. With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse.

One detail to note is that when we apply PPO for a network where we have shared parameters for actor and critic functions, we typically add to the objective function an error term on value estimation and an entropy term to encourage exploration.

Latest Papers

PAPER DATE
Fast MNAS: Uncertainty-aware Neural Architecture Search with Lifelong Learning
Anonymous
2021-01-01
Optimizing Information Bottleneck in Reinforcement Learning: A Stein Variational Approach
Anonymous
2021-01-01
PGPS : Coupling Policy Gradient with Population-based Search
Anonymous
2021-01-01
On Proximal Policy Optimization's Heavy-Tailed Gradients
Anonymous
2021-01-01
Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms
Anonymous
2021-01-01
A Strong On-Policy Competitor To PPO
Anonymous
2021-01-01
Policy Optimization in Zero-Sum Markov Games: Fictitious Self-Play Provably Attains Nash Equilibria
Anonymous
2021-01-01
Deep Coherent Exploration For Continuous Control
Anonymous
2021-01-01
A liquid scintillator for a neutrino Detector working at -50 degree
Zhangquan XieJun CaoYayun DingMengchao LiuXilei SunWei WangYuguang Xie
2020-12-22
Policy Gradient RL Algorithms as Directed Acyclic Graphs
Juan Jose Garau Luis
2020-12-14
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation
Golnaz GhiasiYin CuiAravind SrinivasRui QianTsung-Yi LinEkin D. CubukQuoc V. LeBarret Zoph
2020-12-13
Proximal Policy Optimization Smoothed Algorithm
Wangshu ZhuAndre Rosendo
2020-12-04
Enhanced Scene Specificity with Sparse Dynamic Value Estimation
2020-11-25
Deep Reinforcement Learning for Feedback Control in a Collective Flashing Ratchet
| Dong-Kyum KimHawoong Jeong
2020-11-20
FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance
Xiao-Yang LiuHongyang YangQian ChenRunjia ZhangLiuqing YangBowen XiaoChristina Dan Wang
2020-11-19
Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?
Christian Schroeder de WittTarun GuptaDenys MakoviichukViktor MakoviychukPhilip H. S. TorrMingfei SunShimon Whiteson
2020-11-18
Tonic: A Deep Reinforcement Learning Library for Fast Prototyping and Benchmarking
Fabio Pardo
2020-11-15
Proximal Policy Optimization via Enhanced Exploration Efficiency
Junwei ZhangZhenghao ZhangShuai HanShuai Lü
2020-11-11
Ju-Seung ByunByungmoon KimHuamin Wang
2020-10-20
Recurrent Distributed Reinforcement Learning for Partially Observable Robotic Assembly
Jieliang LuoHui Li
2020-10-15
Discrete Latent Space World Models for Reinforcement Learning
Jan RobineTobias UelwerStefan Harmeling
2020-10-12
Automated Concatenation of Embeddings for Structured Prediction
| Xinyu WangYong JiangNguyen BachTao WangZhongqiang HuangFei HuangKewei Tu
2020-10-10
Proximal Policy Optimization with Relative Pearson Divergence
Taisuke Kobayashi
2020-10-07
Revisiting Design Choices in Proximal Policy Optimization
| Chloe Ching-Yun HsuCelestine Mendler-DünnerMoritz Hardt
2020-09-23
| Karl CobbeJacob HiltonOleg KlimovJohn Schulman
2020-09-09
Data-Driven Transferred Energy Management Strategy for Hybrid Electric Vehicles via Deep Reinforcement Learning
Jiangdong LiaoTeng LiuWenhao TanShaobo LuYalian Yang
2020-09-07
DRLE: Decentralized Reinforcement Learning at the Edge for Traffic Light Control
| Pengyuan ZhouXianfu ChenZhi LiuTristan BraudPan HuiJussi Kangasharju
2020-09-03
Reinforcement Learning for Low-Thrust Trajectory Design of Interplanetary Missions
| Alessandro ZavoliLorenzo Federici
2020-08-19
Towards Closing the Sim-to-Real Gap in Collaborative Multi-Robot Deep Reinforcement Learning
Wenshuai ZhaoJorge Peña QueraltaLi QingqingTomi Westerlund
2020-08-18
Queueing Network Controls via Deep Reinforcement Learning
J. G. DaiMark Gluzman
2020-07-31
Developmental Reinforcement Learning of Control Policy of a Quadcopter UAV with Thrust Vectoring Rotors
Aditya M. DeshpandeRumit KumarAli A. MinaiManish Kumar
2020-07-15
Sample Factory: Egocentric 3D Control from Pixels at 100000 FPS with Asynchronous Reinforcement Learning
| Aleksei PetrenkoZhehui HuangTushar KumarGaurav SukhatmeVladlen Koltun
2020-06-21
An operator view of policy gradient methods
Dibya GhoshMarlos C. MachadoNicolas Le Roux
2020-06-19
Fine-Tuning DARTS for Image Classification
2020-06-16
Optimistic Distributionally Robust Policy Optimization
Jun SongChaoyue Zhao
2020-06-14
Exploration by Maximizing Rényi Entropy for Zero-Shot Meta RL
Chuheng ZhangYuanying CaiLongbo HuangJian Li
2020-06-11
Rethinking Pre-training and Self-training
| Barret ZophGolnaz GhiasiTsung-Yi LinYin CuiHanxiao LiuEkin D. CubukQuoc V. Le
2020-06-11
A Comparison of Self-Play Algorithms Under a Generalized Framework
Daniel HernandezKevin DenamganaiSam DevlinSpyridon SamothrakisJames Alfred Walker
2020-06-08
Optimization and passive flow control using single-step deep reinforcement learning
H. GhraiebJ. ViqueratA. LarcherP. MeligaE. Hachem
2020-06-04
Dynamic Value Estimation for Single-Task Multi-Scene Reinforcement Learning
2020-05-25
Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO
Logan EngstromAndrew IlyasShibani SanturkarDimitris TsiprasFirdaus JanoosLarry RudolphAleksander Madry
2020-05-25
Mirror Descent Policy Optimization
2020-05-20
Generalized State-Dependent Exploration for Deep Reinforcement Learning in Robotics
| Antonin RaffinFreek Stulp
2020-05-12
Model-based reinforcement learning for biological sequence design
Christof AngermuellerDavid DohanDavid BelangerRamya DeshpandeKevin MurphyLucy Colwell
2020-05-01
Robust active flow control over a range of Reynolds numbers using an artificial neural network trained through deep reinforcement learning
Hongwei TangJean RabaultAlexander KuhnleYan WangTongguang Wang
2020-04-26
Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning
Shangtong ZhangBo LiuShimon Whiteson
2020-04-22
Guided Dialog Policy Learning without Adversarial Learning in the Loop
Ziming LiSungjin LeeBaolin PengJinchao LiShahin ShayandehJianfeng Gao
2020-04-07
Obstacle Avoidance and Navigation Utilizing Reinforcement Learning with Reward Shaping
Daniel ZhangColleen P. Bailey
2020-03-28
Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations
| Huan ZhangHongge ChenChaowei XiaoBo LiMingyan LiuDuane BoningCho-Jui Hsieh
2020-03-19
Adaptive Discretization for Continuous Control using Particle Filtering Policy Network
| Pei XuIoannis Karamouzas
2020-03-16
Fast Online Adaptation in Robotics through Meta-Learning Embeddings of Simulated Priors
| Rituraj KaushikTimothée AnneJean-Baptiste Mouret
2020-03-10
Reinforcement Learning Framework for Deep Brain Stimulation Study
| Dmitrii KrylovRemi TachetRomain LarocheMichael RosenblumDmitry V. Dylov
2020-02-22
Deep RL Agent for a Real-Time Action Strategy Game
2020-02-15
Wen-Ji ZhouYang Yu
2020-02-06
Continuous-action Reinforcement Learning for Playing Racing Games: Comparing SPG to PPO
Mario S. HolubarMarco A. Wiering
2020-01-15
Learning Representations in Reinforcement Learning: an Information Bottleneck Approach
Yingjun PeiXinwen Hou
2020-01-01
TPO: TREE SEARCH POLICY OPTIMIZATION FOR CONTINUOUS ACTION SPACES
Amir YazdanbakhshEbrahim SonghoriRobert OrmandiAnna GoldieAzalia Mirhoseini
2020-01-01
SLM Lab: A Comprehensive Benchmark and Modular Software Framework for Reproducible Deep Reinforcement Learning
| Keng Wah LoonLaura GraesserMilan Cvitkovic
2019-12-28
Mastering Complex Control in MOBA Games with Deep Reinforcement Learning
Deheng YeZhao LiuMingfei SunBei ShiPeilin ZhaoHao WuHongsheng YuShaojie YangXipeng WuQingwei GuoQiaobo ChenYinyuting YinHao ZhangTengfei ShiLiang WangQiang FuWei YangLanxiao Huang
2019-12-20
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization
| Xianzhi DuTsung-Yi LinPengchong JinGolnaz GhiasiMingxing TanYin CuiQuoc V. LeXiaodan Song
2019-12-10
MnasFPN: Learning Latency-aware Pyramid Architecture for Object Detection on Mobile Devices
| Bo ChenGolnaz GhiasiHanxiao LiuTsung-Yi LinDmitry KalenichenkoHartwig AdamsQuoc V. Le
2019-12-02
On-policy Reinforcement Learning with Entropy Regularization
Jingbin LiuXinyang GuDexiang ZhangShuai Liu
2019-12-02
Neural Trust Region/Proximal Policy Optimization Attains Globally Optimal Policy
Boyi LiuQi CaiZhuoran YangZhaoran Wang
2019-12-01
Automated curriculum generation for Policy Gradients from Demonstrations
Anirudh SrinivasanDzmitry BahdanauMaxime Chevalier-BoisvertYoshua Bengio
2019-12-01
Learning Reward Machines for Partially Observable Reinforcement Learning
Rodrigo Toro IcarteEthan WaldieToryn KlassenRick ValenzanoMargarita CastroSheila Mcilraith
2019-12-01
IMPACT: Importance Weighted Asynchronous Architectures with Clipped Target Networks
Michael LuoJiahao YaoRichard LiawEric LiangIon Stoica
2019-11-30
Accelerating Training in Pommerman with Imitation and Reinforcement Learning
2019-11-12
Learning Representations in Reinforcement Learning:An Information Bottleneck Approach
Pei YingjunHou Xinwen
2019-11-12
HRL4IN: Hierarchical Reinforcement Learning for Interactive Navigation with Mobile Manipulators
| Chengshu LiFei XiaRoberto Martin-MartinSilvio Savarese
2019-10-24
Quantized Reinforcement Learning (QUARL)
| Srivatsan KrishnanSharad ChitlangiaMaximilian LamZishen WanAleksandra FaustVijay Janapa Reddi
2019-10-02
DoorGym: A Scalable Door Opening Environment And Baseline Agent
| Yusuke UrakamiAlec HodgkinsonCasey CarlinRandall LeuLuca RigazioPieter Abbeel
2019-08-05
Towards Model-based Reinforcement Learning for Industry-near Environments
Per-Arne AndersenMorten GoodwinOle-Christoffer Granmo
2019-07-27
Google Research Football: A Novel Reinforcement Learning Environment
| Karol KurachAnton RaichukPiotr StańczykMichał ZającOlivier BachemLasse EspeholtCarlos RiquelmeDamien VincentMarcin MichalskiOlivier BousquetSylvain Gelly
2019-07-25
PPO Dash: Improving Generalization in Deep Reinforcement Learning
Joe Booth
2019-07-15
Modified Actor-Critics
Erinc MerdivanSten HankeMatthieu Geist
2019-07-02
Learning Data Augmentation Strategies for Object Detection
| Barret ZophEkin D. CubukGolnaz GhiasiTsung-Yi LinJonathon ShlensQuoc V. Le
2019-06-26
Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy
Boyi LiuQi CaiZhuoran YangZhaoran Wang
2019-06-25
Proximal Distilled Evolutionary Reinforcement Learning
Cristian BodnarBen DayPietro Lió
2019-06-24
RL-Based Method for Benchmarking the Adversarial Resilience and Robustness of Deep Reinforcement Learning Policies
2019-06-03
Policy Search by Target Distribution Learning for Continuous Control
| Chuheng ZhangYuanqi LiJian Li
2019-05-27
Combine PPO with NES to Improve Exploration
Lianjiang LiYunrong YangBingna Li
2019-05-23
Deep Q-Learning with Q-Matrix Transfer Learning for Novel Fire Evacuation Environment
Jivitesh SharmaPer-Arne AndersenOle-Chrisoffer GranmoMorten Goodwin
2019-05-23
Dimension-Wise Importance Sampling Weight Clipping for Sample-Efficient Reinforcement Learning
| Seungyul HanYoungchul Sung
2019-05-07
Autonomous Air Traffic Controller: A Deep Multi-Agent Reinforcement Learning Approach
Marc BrittainPeng Wei
2019-05-02
SUPERVISED POLICY UPDATE
| Quan VuongYiming ZhangKeith W. Ross
2019-05-01
Towards Combining On-Off-Policy Methods for Real-World Applications
Kai-Chun HuChen-Huan PiTing Han WeiI-Chen WuStone ChengYi-Wei DaiWei-Yuan Ye
2019-04-24
Rogue-Gym: A New Challenge for Generalization in Reinforcement Learning
| Yuji KanagawaTomoyuki Kaneko
2019-04-17
ShapeMask: Learning to Segment Novel Objects by Refining Shape Priors
| Wei-cheng KuoAnelia AngelovaJitendra MalikTsung-Yi Lin
2019-04-05
Truly Proximal Policy Optimization
| Yuhui WangHao HeChao WenXiaoyang Tan
2019-03-19
Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics
Denis SteckelmacherHélène PlisnierDiederik M. RoijersAnn Nowé
2019-03-11
Trust Region-Guided Proximal Policy Optimization
| Yuhui WangHao HeXiaoyang TanYaozhong Gan
2019-01-29
Distillation Strategies for Proximal Policy Optimization
Sam GreenCraig M. VineyardÇetin Kaya Koç
2019-01-23
On-Policy Trust Region Policy Optimisation with Replay Buffers
| Dmitry KanginNicolas Pugeault
2019-01-18
A Logarithmic Barrier Method For Proximal Policy Optimization
Cheng ZengHongming Zhang
2018-12-16
Policy Optimization with Model-based Explorations
Feiyang PanQingpeng CaiAn-Xiang ZengChun-Xiang PanQing DaHualin HeQing HePingzhong Tang
2018-11-18
On the Complexity of Exploration in Goal-Driven Navigation
| Maruan Al-ShedivatLisa LeeRuslan SalakhutdinovEric Xing
2018-11-16
Equivalent Constraints for Two-View Geometry: Pose Solution/Pure Rotation Identification and 3D Reconstruction
Qi CaiYuanxin WuLilian ZhangPeike Zhang
2018-10-13
NSGA-Net: Neural Architecture Search using Multi-Objective Genetic Algorithm
| Zhichao LuIan WhalenVishnu BoddetiYashesh DhebarKalyanmoy DebErik GoodmanWolfgang Banzhaf
2018-10-08
PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation
| Perttu HämäläinenAmin BabadiXiaoxiao MaJaakko Lehtinen
2018-10-05
Reinforcement Learning with Perturbed Rewards
Jingkang WangYang LiuBo Li
2018-10-02
Adversarial Deep Reinforcement Learning in Portfolio Management
| Zhipeng LiangHao ChenJunhao ZhuKangkang JiangYanran Li
2018-08-29
Proximal Policy Optimization and its Dynamic Version for Sequence Generation
Yi-Lin TuanJinzhi ZhangYujia LiHung-yi Lee
2018-08-24
Policy Optimization With Penalized Point Probability Distance: An Alternative To Proximal Policy Optimization
| Xiangxiang Chu
2018-07-02
Supervised Policy Update for Deep Reinforcement Learning
| Quan VuongYiming ZhangKeith W. Ross
2018-05-29
An Adaptive Clipping Approach for Proximal Policy Optimization
Gang ChenYiming PengMengjie Zhang
2018-04-17
| Tianbing Xu
2018-02-21
An Empirical Analysis of Proximal Policy Optimization with Kronecker-factored Natural Gradients
Jiaming SongYuhuai Wu
2018-01-17
Exploring Deep Recurrent Models with Reinforcement Learning for Molecule Design
Daniel NeilMarwin SeglerLaura GuaschMohamed AhmedDean PlumbleyMatthew SellwoodNathan Brown
2018-01-01
AMBER: Adaptive Multi-Batch Experience Replay for Continuous Action Control
Seungyul HanYoungchul Sung
2017-10-12
Learning Transferable Architectures for Scalable Image Recognition
| Barret ZophVijay VasudevanJonathon ShlensQuoc V. Le
2017-07-21
Proximal Policy Optimization Algorithms
| John SchulmanFilip WolskiPrafulla DhariwalAlec RadfordOleg Klimov
2017-07-20