Methods > Reinforcement Learning > Policy Gradient Methods

A3C, Asynchronous Advantage Actor Critic, is a policy gradient algorithm in reinforcement learning that maintains a policy $\pi\left(a_{t}\mid{s}_{t}; \theta\right)$ and an estimate of the value function $V\left(s_{t}; \theta_{v}\right)$. It operates in the forward view and uses a mix of $n$-step returns to update both the policy and the value-function. The policy and the value function are updated after every $t_{\text{max}}$ actions or when a terminal state is reached. The update performed by the algorithm can be seen as $\nabla_{\theta{'}}\log\pi\left(a_{t}\mid{s_{t}}; \theta{'}\right)A\left(s_{t}, a_{t}; \theta, \theta_{v}\right)$ where $A\left(s_{t}, a_{t}; \theta, \theta_{v}\right)$ is an estimate of the advantage function given by:

$$\sum^{k-1}_{i=0}\gamma^{i}r_{t+i} + \gamma^{k}V\left(s_{t+k}; \theta_{v}\right) - V\left(s_{t}; \theta_{v}\right)$$

where $k$ can vary from state to state and is upper-bounded by $t_{max}$.

The critics in A3C learn the value function while multiple actors are trained in parallel and get synced with global parameters every so often. The gradients are accumulated as part of training for stability - this is like parallelized stochastic gradient descent.

Note that while the parameters $\theta$ of the policy and $\theta_{v}$ of the value function are shown as being separate for generality, we always share some of the parameters in practice. We typically use a convolutional neural network that has one softmax output for the policy $\pi\left(a_{t}\mid{s}_{t}; \theta\right)$ and one linear output for the value function $V\left(s_{t}; \theta_{v}\right)$, with all non-output layers shared.

Source: Asynchronous Methods for Deep Reinforcement Learning

Latest Papers

Visual Explanation using Attention Mechanism in Actor-Critic-based Deep Reinforcement Learning
Hidenori ItayaTsubasa HirakawaTakayoshi YamashitaHironobu FujiyoshiKomei Sugiura
A review of motion planning algorithms for intelligent robotics
Chengmin ZhouBingding HuangPasi Fränti
Asynchronous Advantage Actor Critic: Non-asymptotic Analysis and Linear Speedup
Han ShenKaiqing ZhangMingyi HongTianyi Chen
Dynamic Scheduling for Stochastic Edge-Cloud Computing Environments using A3C learning and Residual Recurrent Neural Networks
| Shreshth TuliShashikant IlagerKotagiri RamamohanaraoRajkumar Buyya
Lagrangian Duality in Reinforcement Learning
Pranay Pasula
Adaptive Discretization for Continuous Control using Particle Filtering Policy Network
| Pei XuIoannis Karamouzas
Explore and Exploit with Heterotic Line Bundle Models
| Magdalena LarforsRobin Schneider
Fully Asynchronous Policy Evaluation in Distributed Reinforcement Learning over Networks
Xingyu ShaJia-Qi ZhangKeyou YouKaiqing ZhangTamer Başar
A Visual Communication Map for Multi-Agent Deep Reinforcement Learning
Ngoc Duy NguyenThanh Thi NguyenDoug CreightonSaeid Nahavandi
Intelligent Roundabout Insertion using Deep Reinforcement Learning
Alessandro Paolo CapassoGiulio BacchianiDaniele Molinari
Intelligent Coordination among Multiple Traffic Intersections Using Multi-Agent Reinforcement Learning
Ujwal Padam TewariVishal BidawatkaVarsha RaveendranVinay SudhakaranShreedhar Kodate ShreeshailJayanth Prakash Kulkarni
Adversary A3C for Robust Reinforcement Learning
Zhaoyuan GuZhenzhong JiaHowie Choset
Learning Reward Machines for Partially Observable Reinforcement Learning
| Rodrigo Toro IcarteEthan WaldieToryn KlassenRick ValenzanoMargarita CastroSheila McIlraith
VUSFA:Variational Universal Successor Features Approximator to Improve Transfer DRL for Target Driven Visual Navigation
| Shamane SiriwardhanaRivindu WeerasakeraDenys J. C. MatthiesSuranga Nanayakkara
Incremental Reinforcement Learning --- a New Continuous Reinforcement Learning Frame Based on Stochastic Differential Equation methods
Tianhao ChenLimei ChengYang LiuWenchuan JiaShugen Ma
Terminal Prediction as an Auxiliary Task for Deep Reinforcement Learning
Bilal KartalPablo Hernandez-LealMatthew E. Taylor
Agent Modeling as Auxiliary Task for Deep Reinforcement Learning
Pablo Hernandez-LealBilal KartalMatthew E. Taylor
Jointly Pre-training with Supervised, Autoencoder, and Value Losses for Deep Reinforcement Learning
| Gabriel V. de la Cruz Jr.Yunshu DuMatthew E. Taylor
Combinational Q-Learning for Dou Di Zhu
| Yang YouLiangwei LiBaisong GuoWeiming WangCewu Lu
Using Monte Carlo Tree Search as a Demonstrator within Asynchronous Deep RL
Bilal KartalPablo Hernandez-LealMatthew E. taylor
Single-Agent Policy Tree Search With Guarantees
| Laurent OrseauLevi H. S. LelisTor LattimoreThéophane Weber
Gradient Band-based Adversarial Training for Generalized Attack Immunity of A3C Path Finding
Tong ChenWenjia NiuYingxiao XiangXiaoxuan BaiJiqiang LiuZhen HanGang Li
Crawling in Rogue's dungeons with (partitioned) A3C
| Andrea AspertiDaniele CortesiFrancesco Sovrano
A Brandom-ian view of Reinforcement Learning towards strong-AI
Atrisha Sarkar
Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning
| Felipe Petroski SuchVashisht MadhavanEdoardo ContiJoel LehmanKenneth O. StanleyJeff Clune
Natural Value Approximators: Learning when to Trust Past Estimates
Zhongwen XuJoseph ModayilHado P. Van HasseltAndre BarretoDavid SilverTom Schaul
Teaching a Machine to Read Maps with Deep Reinforcement Learning
| Gino BrunnerOliver RichterYuyi WangRoger Wattenhofer
Improving Search through A3C Reinforcement Learning based Conversational Agent
Milan AggarwalAarushi AroraShagun SodhaniBalaji Krishnamurthy
DARLA: Improving Zero-Shot Transfer in Reinforcement Learning
| Irina HigginsArka PalAndrei A. RusuLoic MattheyChristopher P. BurgessAlexander PritzelMatthew BotvinickCharles BlundellAlexander Lerchner
Noisy Networks for Exploration
| Meire FortunatoMohammad Gheshlaghi AzarBilal PiotJacob MenickIan OsbandAlex GravesVlad MnihRemi MunosDemis HassabisOlivier PietquinCharles BlundellShane Legg
Learning to Factor Policies and Action-Value Functions: Factored Action Space Representations for Deep Reinforcement learning
Sahil SharmaAravind SureshRahul RameshBalaraman Ravindran
Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning
| Nat DilokthanakulChristos KaplanisNick PawlowskiMurray Shanahan
Equivalence Between Policy Gradients and Soft Q-Learning
John SchulmanXi ChenPieter Abbeel
The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning
Audrunas GruslysWill DabneyMohammad Gheshlaghi AzarBilal PiotMarc BellemareRemi Munos
Tactics of Adversarial Attack on Deep Reinforcement Learning Agents
Yen-Chen LinZhang-Wei HongYuan-Hong LiaoMeng-Li ShihMing-Yu LiuMin Sun
Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU
| Mohammad BabaeizadehIuri FrosioStephen TyreeJason ClemonsJan Kautz
Asynchronous Methods for Deep Reinforcement Learning
| Volodymyr MnihAdrià Puigdomènech BadiaMehdi MirzaAlex GravesTimothy P. LillicrapTim HarleyDavid SilverKoray Kavukcuoglu