Policy Gradient Methods

## A3C

Introduced by Mnih et al. in Asynchronous Methods for Deep Reinforcement Learning

A3C, Asynchronous Advantage Actor Critic, is a policy gradient algorithm in reinforcement learning that maintains a policy $\pi\left(a_{t}\mid{s}_{t}; \theta\right)$ and an estimate of the value function $V\left(s_{t}; \theta_{v}\right)$. It operates in the forward view and uses a mix of $n$-step returns to update both the policy and the value-function. The policy and the value function are updated after every $t_{\text{max}}$ actions or when a terminal state is reached. The update performed by the algorithm can be seen as $\nabla_{\theta{'}}\log\pi\left(a_{t}\mid{s_{t}}; \theta{'}\right)A\left(s_{t}, a_{t}; \theta, \theta_{v}\right)$ where $A\left(s_{t}, a_{t}; \theta, \theta_{v}\right)$ is an estimate of the advantage function given by:

$$\sum^{k-1}_{i=0}\gamma^{i}r_{t+i} + \gamma^{k}V\left(s_{t+k}; \theta_{v}\right) - V\left(s_{t}; \theta_{v}\right)$$

where $k$ can vary from state to state and is upper-bounded by $t_{max}$.

The critics in A3C learn the value function while multiple actors are trained in parallel and get synced with global parameters every so often. The gradients are accumulated as part of training for stability - this is like parallelized stochastic gradient descent.

Note that while the parameters $\theta$ of the policy and $\theta_{v}$ of the value function are shown as being separate for generality, we always share some of the parameters in practice. We typically use a convolutional neural network that has one softmax output for the policy $\pi\left(a_{t}\mid{s}_{t}; \theta\right)$ and one linear output for the value function $V\left(s_{t}; \theta_{v}\right)$, with all non-output layers shared.

#### Latest Papers

PAPER DATE
Visual Explanation using Attention Mechanism in Actor-Critic-based Deep Reinforcement Learning
Hidenori ItayaTsubasa HirakawaTakayoshi YamashitaHironobu FujiyoshiKomei Sugiura
2021-03-06
A review of motion planning algorithms for intelligent robotics
Chengmin ZhouBingding HuangPasi Fränti
2021-02-04
Asynchronous Advantage Actor Critic: Non-asymptotic Analysis and Linear Speedup
Han ShenKaiqing ZhangMingyi HongTianyi Chen
2020-12-31
Dynamic Scheduling for Stochastic Edge-Cloud Computing Environments using A3C learning and Residual Recurrent Neural Networks
| Shreshth TuliShashikant IlagerKotagiri RamamohanaraoRajkumar Buyya
2020-09-01
Lagrangian Duality in Reinforcement Learning
Pranay Pasula
2020-07-20
Adaptive Discretization for Continuous Control using Particle Filtering Policy Network
| Pei XuIoannis Karamouzas
2020-03-16
Explore and Exploit with Heterotic Line Bundle Models
| Magdalena LarforsRobin Schneider
2020-03-10
Fully Asynchronous Policy Evaluation in Distributed Reinforcement Learning over Networks
Xingyu ShaJia-Qi ZhangKeyou YouKaiqing ZhangTamer Başar
2020-03-01
A Visual Communication Map for Multi-Agent Deep Reinforcement Learning
Ngoc Duy NguyenThanh Thi NguyenDoug CreightonSaeid Nahavandi
2020-02-27
Intelligent Roundabout Insertion using Deep Reinforcement Learning
Alessandro Paolo CapassoGiulio BacchianiDaniele Molinari
2020-01-03
Intelligent Coordination among Multiple Traffic Intersections Using Multi-Agent Reinforcement Learning
Ujwal Padam TewariVishal BidawatkaVarsha RaveendranVinay SudhakaranShreedhar Kodate ShreeshailJayanth Prakash Kulkarni
2019-12-09
Adversary A3C for Robust Reinforcement Learning
Zhaoyuan GuZhenzhong JiaHowie Choset
2019-12-01
Learning Reward Machines for Partially Observable Reinforcement Learning
| Rodrigo Toro IcarteEthan WaldieToryn KlassenRick ValenzanoMargarita CastroSheila McIlraith
2019-12-01
VUSFA:Variational Universal Successor Features Approximator to Improve Transfer DRL for Target Driven Visual Navigation
| Shamane SiriwardhanaRivindu WeerasakeraDenys J. C. MatthiesSuranga Nanayakkara
2019-08-18
Incremental Reinforcement Learning --- a New Continuous Reinforcement Learning Frame Based on Stochastic Differential Equation methods
Tianhao ChenLimei ChengYang LiuWenchuan JiaShugen Ma
2019-08-08
Terminal Prediction as an Auxiliary Task for Deep Reinforcement Learning
Bilal KartalPablo Hernandez-LealMatthew E. Taylor
2019-07-24
Agent Modeling as Auxiliary Task for Deep Reinforcement Learning
Pablo Hernandez-LealBilal KartalMatthew E. Taylor
2019-07-22
Jointly Pre-training with Supervised, Autoencoder, and Value Losses for Deep Reinforcement Learning
| Gabriel V. de la Cruz Jr.Yunshu DuMatthew E. Taylor
2019-04-03
Combinational Q-Learning for Dou Di Zhu
| Yang YouLiangwei LiBaisong GuoWeiming WangCewu Lu
2019-01-24
Using Monte Carlo Tree Search as a Demonstrator within Asynchronous Deep RL
Bilal KartalPablo Hernandez-LealMatthew E. taylor
2018-11-30
Single-Agent Policy Tree Search With Guarantees
| Laurent OrseauLevi H. S. LelisTor LattimoreThéophane Weber
2018-11-27
Gradient Band-based Adversarial Training for Generalized Attack Immunity of A3C Path Finding
Tong ChenWenjia NiuYingxiao XiangXiaoxuan BaiJiqiang LiuZhen HanGang Li
2018-07-18
Crawling in Rogue's dungeons with (partitioned) A3C
| Andrea AspertiDaniele CortesiFrancesco Sovrano
2018-04-23
A Brandom-ian view of Reinforcement Learning towards strong-AI
Atrisha Sarkar
2018-03-07
Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning
| Felipe Petroski SuchVashisht MadhavanEdoardo ContiJoel LehmanKenneth O. StanleyJeff Clune
2017-12-18
Natural Value Approximators: Learning when to Trust Past Estimates
Zhongwen XuJoseph ModayilHado P. Van HasseltAndre BarretoDavid SilverTom Schaul
2017-12-01
Teaching a Machine to Read Maps with Deep Reinforcement Learning
| Gino BrunnerOliver RichterYuyi WangRoger Wattenhofer
2017-11-20
Improving Search through A3C Reinforcement Learning based Conversational Agent
Milan AggarwalAarushi AroraShagun SodhaniBalaji Krishnamurthy
2017-09-17
DARLA: Improving Zero-Shot Transfer in Reinforcement Learning
| Irina HigginsArka PalAndrei A. RusuLoic MattheyChristopher P. BurgessAlexander PritzelMatthew BotvinickCharles BlundellAlexander Lerchner
2017-07-26
Noisy Networks for Exploration
| Meire FortunatoMohammad Gheshlaghi AzarBilal PiotJacob MenickIan OsbandAlex GravesVlad MnihRemi MunosDemis HassabisOlivier PietquinCharles BlundellShane Legg
2017-06-30
Learning to Factor Policies and Action-Value Functions: Factored Action Space Representations for Deep Reinforcement learning
Sahil SharmaAravind SureshRahul RameshBalaraman Ravindran
2017-05-20
Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning
| Nat DilokthanakulChristos KaplanisNick PawlowskiMurray Shanahan
2017-05-18
Equivalence Between Policy Gradients and Soft Q-Learning
John SchulmanXi ChenPieter Abbeel
2017-04-21
The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning
Audrunas GruslysWill DabneyMohammad Gheshlaghi AzarBilal PiotMarc BellemareRemi Munos
2017-04-15
Tactics of Adversarial Attack on Deep Reinforcement Learning Agents
Yen-Chen LinZhang-Wei HongYuan-Hong LiaoMeng-Li ShihMing-Yu LiuMin Sun
2017-03-08
Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU
| Mohammad BabaeizadehIuri FrosioStephen TyreeJason ClemonsJan Kautz
2016-11-18
Asynchronous Methods for Deep Reinforcement Learning
| Volodymyr MnihAdrià Puigdomènech BadiaMehdi MirzaAlex GravesTimothy P. LillicrapTim HarleyDavid SilverKoray Kavukcuoglu
2016-02-04