no code implementations • 4 Aug 2023 • Jonatha Anselmi, Bruno Gaujal, Louis-Sébastien Rebuffi
While reinforcement learning in Partially Observable Markov Decision Processes (POMDP) is prohibitively expensive in general, we show that our algorithm has a regret that only depends sub-linearly on the maximal number of jobs in the network, $S$.
no code implementations • 21 Feb 2023 • Jonatha Anselmi, Bruno Gaujal, Louis-Sébastien Rebuffi
In our main result however, we exploit the structure of our MDPs to show that the regret of a slightly-tweaked version of the classical learning algorithm {\sc Ucrl2} is in fact upper bounded by $\tilde{\mathcal{O}}(\sqrt{E_2AT})$ where $E_2$ is related to the weighted second moment of the stationary measure of a reference policy.
no code implementations • 13 Jan 2023 • Romain Cravic, Nicolas Gast, Bruno Gaujal
We propose the first model-free algorithm that achieves low regret performance for decentralized learning in two-player zero-sum tabular stochastic games with infinite-horizon average-reward objective.
no code implementations • 16 Jun 2021 • Nicolas Gast, Bruno Gaujal, Kimang Khun
While the regret bound and runtime of vanilla implementations of PSRL and UCRL2 are exponential in the number of bandits, we show that the episodic regret of MB-PSRL and MB-UCRL2 is $\tilde{O}(S\sqrt{nK})$ where $K$ is the number of episodes, $n$ is the number of bandits and $S$ is the number of states of each bandit (the exact bound in S, n and K is given in the paper).
no code implementations • 16 Dec 2020 • Nicolas Gast, Bruno Gaujal, Chen Yan
In this paper we show that, under the same conditions, the convergence rate is exponential in the number of bandits, unless the fixed point is singular (to be defined later).
Performance Optimization and Control Probability
no code implementations • 9 Mar 2013 • Pierre Coucheney, Bruno Gaujal, Panayotis Mertikopoulos
Starting from a heuristic learning scheme for N-person games, we derive a new class of continuous-time learning dynamics consisting of a replicator-like drift adjusted by a penalty term that renders the boundary of the game's strategy space repelling.