Optimistic Policy Optimization with General Function Approximations
Although policy optimization with neural networks has a track record of achieving state-of-the-art results in reinforcement learning on various domains, the theoretical understanding of the computational and sample efficiency of policy optimization remains restricted to linear function approximations with finite-dimensional feature representations, which hinders the design of principled, effective, and efficient algorithms. To this end, we propose an optimistic policy optimization algorithm, which allows general function approximations while incorporating~exploration. In the episodic setting, we establish a $\sqrt{T}$-regret that scales polynomially in the eluder dimension of the general model class. Here $T$ is the number of steps taken by the agent. In particular, we specialize such a regret to handle two nonparametric model classes; one based on reproducing kernel Hilbert spaces and another based on overparameterized neural networks.
PDF Abstract