Faster Non-asymptotic Convergence for Double Q-learning

NeurIPS 2021  ·  Lin Zhao, Huaqing Xiong, Yingbin Liang ·

Double Q-learning (Hasselt, 2010) has gained significant success in practice due to its effectiveness in overcoming the overestimation issue of Q-learning. However, the theoretical understanding of double Q-learning is rather limited. The only existing finite-time analysis was recently established in (Xiong et al. 2020), where the polynomial learning rate adopted in the analysis typically yields a slower convergence rate. This paper tackles the more challenging case of a constant learning rate, and develops new analytical tools that improve the existing convergence rate by orders of magnitude. Specifically, we show that synchronous double Q-learning attains an $\epsilon$-accurate global optimum with a time complexity of $\tilde{\Omega}\left(\frac{\ln D}{(1-\gamma)^7\epsilon^2} \right)$, and the asynchronous algorithm achieves a time complexity of $\tilde{\Omega}\left(\frac{L}{(1-\gamma)^7\epsilon^2} \right)$, where $D$ is the cardinality of the state-action space, $\gamma$ is the discount factor, and $L$ is a parameter related to the sampling strategy for asynchronous double Q-learning. These results improve the existing convergence rate by the order of magnitude in terms of its dependence on all major parameters $(\epsilon,1-\gamma, D, L)$. This paper presents a substantial step toward the full understanding of the fast convergence of double-Q learning.

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract
No code implementations yet. Submit your code now

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods