Search Results for author: Bruno Scherrer

Found 22 papers, 0 papers with code

Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm

no code implementations • 17 Mar 2021 • Lin Chen, Bruno Scherrer, Peter L. Bartlett

In this regime, for any $q\in[\gamma^{2}, 1]$, we can construct a hard instance such that the smallest eigenvalue of its feature covariance matrix is $q/d$ and it requires $\Omega\left(\frac{d}{\gamma^{2}\left(q-\gamma^{2}\right)\varepsilon^{2}}\exp\left(\Theta\left(d\gamma^{2}\right)\right)\right)$ samples to approximate the value function up to an additive error $\varepsilon$.

Off-policy evaluation

Paper
Add Code

Leverage the Average: an Analysis of KL Regularization in Reinforcement Learning

no code implementations • NeurIPS 2020 • Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Remi Munos, Matthieu Geist

Recent Reinforcement Learning (RL) algorithms making use of Kullback-Leibler (KL) regularization as a core component have shown outstanding performance.

reinforcement-learning Reinforcement Learning (RL)

Paper
Add Code

Leverage the Average: an Analysis of KL Regularization in RL

no code implementations • 31 Mar 2020 • Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, Matthieu Geist

Recent Reinforcement Learning (RL) algorithms making use of Kullback-Leibler (KL) regularization as a core component have shown outstanding performance.

Reinforcement Learning (RL)

Paper
Add Code

Momentum in Reinforcement Learning

no code implementations • 21 Oct 2019 • Nino Vieillard, Bruno Scherrer, Olivier Pietquin, Matthieu Geist

We adapt the optimization's concept of momentum to reinforcement learning.

Atari Games reinforcement-learning +1

Paper
Add Code

A Theory of Regularized Markov Decision Processes

no code implementations • 31 Jan 2019 • Matthieu Geist, Bruno Scherrer, Olivier Pietquin

Many recent successful (deep) reinforcement learning algorithms make use of regularization, generally based on entropy or Kullback-Leibler divergence.

Q-Learning

Paper
Add Code

Multiple-Step Greedy Policies in Approximate and Online Reinforcement Learning

no code implementations • NeurIPS 2018 • Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor

Multiple-step lookahead policies have demonstrated high empirical competence in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model Predictive Control.

Model Predictive Control reinforcement-learning +1

Paper
Add Code

Anderson Acceleration for Reinforcement Learning

no code implementations • 25 Sep 2018 • Matthieu Geist, Bruno Scherrer

Anderson acceleration is an old and simple method for accelerating the computation of a fixed point.

reinforcement-learning Reinforcement Learning (RL)

Paper
Add Code

How to Combine Tree-Search Methods in Reinforcement Learning

no code implementations • 6 Sep 2018 • Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor

Finite-horizon lookahead policies are abundantly used in Reinforcement Learning and demonstrate impressive empirical success.

reinforcement-learning Reinforcement Learning (RL)

Paper
Add Code

Beyond the One-Step Greedy Approach in Reinforcement Learning

no code implementations • ICML 2018 • Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor

The famous Policy Iteration algorithm alternates between policy improvement and policy evaluation.

reinforcement-learning Reinforcement Learning (RL)

Paper
Add Code

Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning

no code implementations • 21 May 2018 • Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor

Multiple-step lookahead policies have demonstrated high empirical competence in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model Predictive Control.

Model Predictive Control reinforcement-learning +1

Paper
Add Code

Beyond the One Step Greedy Approach in Reinforcement Learning

no code implementations • 10 Feb 2018 • Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor

The famous Policy Iteration algorithm alternates between policy improvement and policy evaluation.

reinforcement-learning Reinforcement Learning (RL)

Paper
Add Code

Rate of Convergence and Error Bounds for LSTD($λ$)

no code implementations • 13 May 2014 • Manel Tagorti, Bruno Scherrer

We consider LSTD($\lambda$), the least-squares temporal-difference algorithm with eligibility traces algorithm proposed by Boyan (2002).

Paper
Add Code

Approximate Policy Iteration Schemes: A Comparison

no code implementations • 12 May 2014 • Bruno Scherrer

2) PSDP$_\infty$ enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API.

Paper
Add Code

Approximate Dynamic Programming Finally Performs Well in the Game of Tetris

no code implementations • NeurIPS 2013 • Victor Gabillon, Mohammad Ghavamzadeh, Bruno Scherrer

A close look at the literature of this game shows that while ADP algorithms, that have been (almost) entirely based on approximating the value function (value function based), have performed poorly in Tetris, the methods that search directly in the space of policies by learning the policy parameters using an optimization black box, such as the cross entropy (CE) method, have achieved the best reported results.

Paper
Add Code

Policy Search: Any Local Optimum Enjoys a Global Performance Guarantee

no code implementations • 6 Jun 2013 • Bruno Scherrer, Matthieu Geist

Local Policy Search is a popular reinforcement learning approach for handling large state spaces.

Paper
Add Code

On the Performance Bounds of some Policy Search Dynamic Programming Algorithms

no code implementations • 3 Jun 2013 • Bruno Scherrer

We then describe an algorithm, Non-Stationary Direct Policy Iteration (NSDPI), that can either be seen as 1) a variation of Policy Search by Dynamic Programming by Bagnell et al. (2003) to the infinite horizon situation or 2) a simplified version of the Non-Stationary PI with growing period of Scherrer and Lesner (2012).

Paper
Add Code

Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

no code implementations • NeurIPS 2013 • Bruno Scherrer

We consider two variations of PI: Howard'sPI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximaladvantage.

Paper
Add Code

Tight Performance Bounds for Approximate Modified Policy Iteration with Non-Stationary Policies

no code implementations • 20 Apr 2013 • Boris Lesner, Bruno Scherrer

For this algorithm we provide an error propagation analysis in the form of a performance bound of the resulting policies that can improve the usual performance bound by a factor $O(1-\gamma)$, which is significant when the discount factor $\gamma$ is close to 1.

Paper
Add Code

Off-policy Learning with Eligibility Traces: A Survey

no code implementations • 15 Apr 2013 • Matthieu Geist, Bruno Scherrer

In the framework of Markov Decision Processes, off-policy learning, that is the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy.

Paper
Add Code

On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

no code implementations • NeurIPS 2012 • Bruno Scherrer, Boris Lesner

We consider infinite-horizon stationary $\gamma$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy.

Paper
Add Code

Approximate Modified Policy Iteration

no code implementations • 14 May 2012 • Bruno Scherrer, Victor Gabillon, Mohammad Ghavamzadeh, Matthieu Geist

Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods.

General Classification

Paper
Add Code

Biasing Approximate Dynamic Programming with a Lower Discount Factor

no code implementations • NeurIPS 2008 • Marek Petrik, Bruno Scherrer

We thus propose another justification: when the rewards are received only sporadically (as it is the case in Tetris), we can derive tighter bounds, which support a significant performance increase with a decrease in the discount factor.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.