no code implementations • 21 Feb 2022 • Shuqing Shi, Xiaobin Wang, Zhiyou Yang, Fan Zhang, Hong Qu
This algorithm achieves a total regret bound of $\tilde{\mathcal{O}}(D\sqrt{SAT})$in time horizon $T$ with $S$ states, $A$ actions and diameter $D$.
Thompson Sampling