## Trust Region Policy Optimization

Introduced by Schulman et al. in Trust Region Policy Optimization

Trust Region Policy Optimization, or TRPO, is a policy gradient method in reinforcement learning that avoids parameter updates that change the policy too much with a KL divergence constraint on the size of the policy update at each iteration.

Take the case of off-policy reinforcement learning, where the policy $\beta$ for collecting trajectories on rollout workers is different from the policy $\pi$ to optimize for. The objective function in an off-policy model measures the total advantage over the state visitation distribution and actions, while the mismatch between the training data distribution and the true policy state distribution is compensated with an importance sampling estimator:

$$J\left(\theta\right) = \sum_{s\in{S}}p^{\pi_{\theta_{old}}}\sum_{a\in\mathcal{A}}\left(\pi_{\theta}\left(a\mid{s}\right)\hat{A}_{\theta_{old}}\left(s, a\right)\right)$$

$$J\left(\theta\right) = \sum_{s\in{S}}p^{\pi_{\theta_{old}}}\sum_{a\in\mathcal{A}}\left(\beta\left(a\mid{s}\right)\frac{\pi_{\theta}\left(a\mid{s}\right)}{\beta\left(a\mid{s}\right)}\hat{A}_{\theta_{old}}\left(s, a\right)\right)$$

$$J\left(\theta\right) = \mathbb{E}_{s\sim{p}^{\pi_{\theta_{old}}}, a\sim{\beta}} \left(\frac{\pi_{\theta}\left(a\mid{s}\right)}{\beta\left(a\mid{s}\right)}\hat{A}_{\theta_{old}}\left(s, a\right)\right)$$

When training on policy, theoretically the policy for collecting data is same as the policy that we want to optimize. However, when rollout workers and optimizers are running in parallel asynchronously, the behavior policy can get stale. TRPO considers this subtle difference: It labels the behavior policy as $\pi_{\theta_{old}}\left(a\mid{s}\right)$ and thus the objective function becomes:

$$J\left(\theta\right) = \mathbb{E}_{s\sim{p}^{\pi_{\theta_{old}}}, a\sim{\pi_{\theta_{old}}}} \left(\frac{\pi_{\theta}\left(a\mid{s}\right)}{\pi_{\theta_{old}}\left(a\mid{s}\right)}\hat{A}_{\theta_{old}}\left(s, a\right)\right)$$

TRPO aims to maximize the objective function $J\left(\theta\right)$ subject to a trust region constraint which enforces the distance between old and new policies measured by KL-divergence to be small enough, within a parameter $\delta$:

$$\mathbb{E}_{s\sim{p}^{\pi_{\theta_{old}}}} \left[D_{KL}\left(\pi_{\theta_{old}}\left(.\mid{s}\right)\mid\mid\pi_{\theta}\left(.\mid{s}\right)\right)\right] \leq \delta$$

#### Latest Papers

PAPER DATE
Optimization Issues in KL-Constrained Approximate Policy Iteration
Nevena LazićBotao HaoYasin Abbasi-YadkoriDale SchuurmansCsaba Szepesvári
2021-02-11
A review of motion planning algorithms for intelligent robotics
Chengmin ZhouBingding HuangPasi Fränti
2021-02-04
Truly Deterministic Policy Optimization
Anonymous
2021-01-01
Tonic: A Deep Reinforcement Learning Library for Fast Prototyping and Benchmarking
| Fabio Pardo
2020-11-15
Control with adaptive Q-learning
| João Pedro AraújoMário A. T. FigueiredoMiguel Ayala Botto
2020-11-03
Multi-Agent Trust Region Policy Optimization
Hepeng LiHaibo He
2020-10-15
Faded-Experience Trust Region Policy Optimization for Model-Free Power Allocation in Interference Channel
Mohammad G. KhoshkholghHalim Yanikomeroglu
2020-08-04
Lagrangian Duality in Reinforcement Learning
Pranay Pasula
2020-07-20
Optimistic Distributionally Robust Policy Optimization
| Jun SongChaoyue Zhao
2020-06-14
Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO
| Logan EngstromAndrew IlyasShibani SanturkarDimitris TsiprasFirdaus JanoosLarry RudolphAleksander Madry
2020-05-25
Mirror Descent Policy Optimization
2020-05-20
Implementation Matters in Deep RL: A Case Study on PPO and TRPO
| Logan EngstromAndrew IlyasShibani SanturkarDimitris TsiprasFirdaus JanoosLarry RudolphAleksander Madry
2020-05-01
Analyzing Policy Distillation on Multi-Task Learning and Meta-Reinforcement Learning in Meta-World
Nathan BlairVictor Chan and Adarsh Karnati
2020-02-08
Risk-Averse Trust Region Optimization for Reward-Volatility Reduction
Lorenzo BisiLuca SabbioniEdoardo VittoriMatteo PapiniMarcello Restelli
2019-12-06
Neural Trust Region/Proximal Policy Optimization Attains Globally Optimal Policy
Boyi LiuQi CaiZhuoran YangZhaoran Wang
2019-12-01
Learning Reward Machines for Partially Observable Reinforcement Learning
| Rodrigo Toro IcarteEthan WaldieToryn KlassenRick ValenzanoMargarita CastroSheila McIlraith
2019-12-01
Multi-step Greedy Reinforcement Learning Algorithms
2019-10-07
Revisit Policy Optimization in Matrix Form
Sitao LuanXiao-Wen ChangDoina Precup
2019-09-19
Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs
Lior ShaniYonathan EfroniShie Mannor
2019-09-06
Hindsight Trust Region Policy Optimization
| Hanbo ZhangSite BaiXuguang LanDavid HsuNanning Zheng
2019-07-29
Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy
Boyi LiuQi CaiZhuoran YangZhaoran Wang
2019-06-25
SUPERVISED POLICY UPDATE
| Quan VuongYiming ZhangKeith W. Ross
2019-05-01
Policy Optimization via Stochastic Recursive Gradient Algorithm
Huizhuo YuanChris Junchi LiYuhao TangYuren Zhou
2019-05-01
Discretizing Continuous Action Space for On-Policy Optimization
| Yunhao TangShipra Agrawal
2019-01-29
On-Policy Trust Region Policy Optimisation with Replay Buffers
| Dmitry KanginNicolas Pugeault
2019-01-18
Multi-objective Model-based Policy Search for Data-efficient Learning with Sparse Rewards
| Rituraj KaushikKonstantinos ChatzilygeroudisJean-Baptiste Mouret
2018-06-25
Supervised Policy Update for Deep Reinforcement Learning
| Quan VuongYiming ZhangKeith W. Ross
2018-05-29
Variational Inference for Policy Gradient
Tianbing Xu
2018-02-21
Sample Efficient Deep Reinforcement Learning for Dialogue Systems with Large Action Spaces
Gellért WeiszPaweł BudzianowskiPei-Hao SuMilica Gašić
2018-02-11
Pretraining Deep Actor-Critic Reinforcement Learning Algorithms With Expert Demonstrations
Xiaoqin ZhangHuimin Ma
2018-01-31
Multi-task Learning with Gradient Guided Policy Specialization
Wenhao YuC. Karen LiuGreg Turk
2017-09-23
Trust-PCL: An Off-Policy Trust Region Method for Continuous Control
| Ofir NachumMohammad NorouziKelvin XuDale Schuurmans
2017-07-06
Parameter Space Noise for Exploration
| Matthias PlappertRein HouthooftPrafulla DhariwalSzymon SidorRichard Y. ChenXi ChenTamim AsfourPieter AbbeelMarcin Andrychowicz
2017-06-06
A unified view of entropy-regularized Markov decision processes
Gergely NeuAnders JonssonVicenç Gómez
2017-05-22
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
| Chelsea FinnPieter AbbeelSergey Levine
2017-03-09
Sample Efficient Actor-Critic with Experience Replay
| Ziyu WangVictor BapstNicolas HeessVolodymyr MnihRemi MunosKoray KavukcuogluNando de Freitas
2016-11-03
Trust Region Policy Optimization
| John SchulmanSergey LevinePhilipp MoritzMichael. I. JordanPieter Abbeel
2015-02-19