no code implementations • 17 Mar 2021 • Lin Chen, Bruno Scherrer, Peter L. Bartlett
In this regime, for any $q\in[\gamma^{2}, 1]$, we can construct a hard instance such that the smallest eigenvalue of its feature covariance matrix is $q/d$ and it requires $\Omega\left(\frac{d}{\gamma^{2}\left(q-\gamma^{2}\right)\varepsilon^{2}}\exp\left(\Theta\left(d\gamma^{2}\right)\right)\right)$ samples to approximate the value function up to an additive error $\varepsilon$.
no code implementations • NeurIPS 2020 • Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Remi Munos, Matthieu Geist
Recent Reinforcement Learning (RL) algorithms making use of Kullback-Leibler (KL) regularization as a core component have shown outstanding performance.
no code implementations • 31 Mar 2020 • Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, Matthieu Geist
Recent Reinforcement Learning (RL) algorithms making use of Kullback-Leibler (KL) regularization as a core component have shown outstanding performance.
no code implementations • 21 Oct 2019 • Nino Vieillard, Bruno Scherrer, Olivier Pietquin, Matthieu Geist
We adapt the optimization's concept of momentum to reinforcement learning.
no code implementations • 31 Jan 2019 • Matthieu Geist, Bruno Scherrer, Olivier Pietquin
Many recent successful (deep) reinforcement learning algorithms make use of regularization, generally based on entropy or Kullback-Leibler divergence.
no code implementations • NeurIPS 2018 • Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor
Multiple-step lookahead policies have demonstrated high empirical competence in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model Predictive Control.
no code implementations • 25 Sep 2018 • Matthieu Geist, Bruno Scherrer
Anderson acceleration is an old and simple method for accelerating the computation of a fixed point.
no code implementations • 6 Sep 2018 • Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor
Finite-horizon lookahead policies are abundantly used in Reinforcement Learning and demonstrate impressive empirical success.
no code implementations • ICML 2018 • Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor
The famous Policy Iteration algorithm alternates between policy improvement and policy evaluation.
no code implementations • 21 May 2018 • Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor
Multiple-step lookahead policies have demonstrated high empirical competence in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model Predictive Control.
no code implementations • 10 Feb 2018 • Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor
The famous Policy Iteration algorithm alternates between policy improvement and policy evaluation.
no code implementations • 13 May 2014 • Manel Tagorti, Bruno Scherrer
We consider LSTD($\lambda$), the least-squares temporal-difference algorithm with eligibility traces algorithm proposed by Boyan (2002).
no code implementations • 12 May 2014 • Bruno Scherrer
2) PSDP$_\infty$ enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API.
no code implementations • NeurIPS 2013 • Victor Gabillon, Mohammad Ghavamzadeh, Bruno Scherrer
A close look at the literature of this game shows that while ADP algorithms, that have been (almost) entirely based on approximating the value function (value function based), have performed poorly in Tetris, the methods that search directly in the space of policies by learning the policy parameters using an optimization black box, such as the cross entropy (CE) method, have achieved the best reported results.
no code implementations • 6 Jun 2013 • Bruno Scherrer, Matthieu Geist
Local Policy Search is a popular reinforcement learning approach for handling large state spaces.
no code implementations • 3 Jun 2013 • Bruno Scherrer
We then describe an algorithm, Non-Stationary Direct Policy Iteration (NSDPI), that can either be seen as 1) a variation of Policy Search by Dynamic Programming by Bagnell et al. (2003) to the infinite horizon situation or 2) a simplified version of the Non-Stationary PI with growing period of Scherrer and Lesner (2012).
no code implementations • NeurIPS 2013 • Bruno Scherrer
We consider two variations of PI: Howard'sPI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximaladvantage.
no code implementations • 20 Apr 2013 • Boris Lesner, Bruno Scherrer
For this algorithm we provide an error propagation analysis in the form of a performance bound of the resulting policies that can improve the usual performance bound by a factor $O(1-\gamma)$, which is significant when the discount factor $\gamma$ is close to 1.
no code implementations • 15 Apr 2013 • Matthieu Geist, Bruno Scherrer
In the framework of Markov Decision Processes, off-policy learning, that is the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy.
no code implementations • NeurIPS 2012 • Bruno Scherrer, Boris Lesner
We consider infinite-horizon stationary $\gamma$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy.
no code implementations • 14 May 2012 • Bruno Scherrer, Victor Gabillon, Mohammad Ghavamzadeh, Matthieu Geist
Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods.
no code implementations • NeurIPS 2008 • Marek Petrik, Bruno Scherrer
We thus propose another justification: when the rewards are received only sporadically (as it is the case in Tetris), we can derive tighter bounds, which support a significant performance increase with a decrease in the discount factor.