Stochastic exploration is the key to the success of the Deep Q-network (DQN) algorithm. However, most existing stochastic exploration approaches either explore actions heuristically regardless of their Q-values or couple the sampling with Q-values which inevitably introduce bias into the learning process. In this paper, we propose a novel preference-guided $\epsilon$-greedy exploration algorithm that can efficiently facilitate exploration for DQN without introducing additional bias. Specifically, we design a dual architecture consisting of two branches, one of which is a copy of DQN, namely the Q-branch. The other branch, which we call the preference branch, learns the action preference that the DQN implicitly follows. We theoretically prove that the policy improvement theorem holds for the preference-guided $\epsilon$-greedy policy and experimentally show that the inferred action preference distribution aligns with the landscape of corresponding Q-values. Intuitively, the preference-guided $\epsilon$-greedy exploration motivates the DQN agent to take diverse actions so that actions with larger Q-values can be sampled more frequently, and those with smaller Q-values still have a chance to be explored, thus encouraging the exploration. We comprehensively evaluate the proposed method by benchmarking it with well-known DQN variants in nine different environments. Extensive results confirm the superiority of our proposed method in terms of performance and convergence speed. The demonstrative video and source code are available at \url{https://github.com/OscarHuangWind/Preference-Guided-DQN-Atari}.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods