The Remarkable Effectiveness of Combining Policy and Value Networks in A*-based Deep RL for AI Planning

29 Sep 2021 · Dieqiao Feng, Carla P Gomes, Bart Selman ·

Despite the tremendous success of applying traditional backtrack-style combinatorial search methods in various NP-complete domains such as SAT and CSP as well as using deep reinforcement learning (RL) to tackle two-player games such as Go, PSPACE-hard AI planning has remained out of reach for current AI planning systems. Even carefully designed domain-specific solvers fail quickly due to the exponential combinatorial search space on hard instances. Recent work based on deep learning guided search algorithms that combine traditional search-based methods, such as A\textsc{*} and MCTS search, with deep neural networks' heuristic prediction has shown promising progress. These methods can solve a significant number of hard planning instances beyond specialized solvers. To better understanding why these approaches work we study the interplay of the policy and value networks in A\textsc{*}-based deep RL and show the surprising effectiveness of the policy network, further enhanced by the value network, as a guiding heuristic for A\textsc{*}. To further understand the phenomena, we study the cost distributions of deep planners and found planning instances can have heavy-tailed runtime distributions, with tails both on the right-hand and left-hand sides. In particular, for the first time, we show the existence of {\textit{left}} heavy tails and propose a theoretical model that can explain the appearance of these tails. We provide extensive experimental data supporting our model. The experiments show the critical role of the policy network as a powerful heuristic guiding A\textsc{*}, which can lead to left tails with polynomial scaling by avoiding exploring exponential size sub-trees early on in the search. Our results also demonstrate the importance of random restart strategies, as are widely used in traditional combinatorial solvers, for deep reinforcement learning and deep AI planning systems to avoid left and right heavy tails.

PDF Abstract