Search Results for author: Remi Munos

Found 52 papers, 23 papers with code

Fast computation of Nash Equilibria in Imperfect Information Games

no code implementations ICML 2020 Remi Munos, Julien Perolat, Jean-Baptiste Lespiau, Mark Rowland, Bart De Vylder, Marc Lanctot, Finbarr Timbers, Daniel Hennes, Shayegan Omidshafiei, Audrunas Gruslys, Mohammad Gheshlaghi Azar, Edward Lockhart, Karl Tuyls

We introduce and analyze a class of algorithms, called Mirror Ascent against an Improved Opponent (MAIO), for computing Nash equilibria in two-player zero-sum games, both in normal form and in sequential imperfect information form.

Human Alignment of Large Language Models through Online Preference Optimisation

no code implementations13 Mar 2024 Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, Rishabh Joshi, Zeyu Zheng, Bilal Piot

Building on this equivalence, we introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm.

Fast Rates for Maximum Entropy Exploration

1 code implementation14 Mar 2023 Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Pierre Perrault, Yunhao Tang, Michal Valko, Pierre Menard

Finally, we apply developed regularization techniques to reduce sample complexity of visitation entropy maximization to $\widetilde{\mathcal{O}}(H^2SA/\varepsilon^2)$, yielding a statistical separation between maximum entropy exploration and reward-free exploration.

Reinforcement Learning (RL)

Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees

1 code implementation28 Sep 2022 Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Mark Rowland, Michal Valko, Pierre Menard

We consider reinforcement learning in an environment modeled by an episodic, finite, stage-dependent Markov decision process of horizon $H$ with $S$ states, and $A$ actions.

reinforcement-learning Reinforcement Learning (RL)

Learning in two-player zero-sum partially observable Markov games with perfect recall

no code implementations NeurIPS 2021 Tadashi Kozuno, Pierre Ménard, Remi Munos, Michal Valko

We study the problem of learning a Nash equilibrium (NE) in an extensive game with imperfect information (EGII) through self-play.

Navigating the Landscape of Multiplayer Games

no code implementations4 May 2020 Shayegan Omidshafiei, Karl Tuyls, Wojciech M. Czarnecki, Francisco C. Santos, Mark Rowland, Jerome Connor, Daniel Hennes, Paul Muller, Julien Perolat, Bart De Vylder, Audrunas Gruslys, Remi Munos

Multiplayer games have long been used as testbeds in artificial intelligence research, aptly referred to as the Drosophila of artificial intelligence.

Planning in entropy-regularized Markov decision processes and games

1 code implementation NeurIPS 2019 Jean-bastien Grill, Omar Darwiche Domingues, Pierre Menard, Remi Munos, Michal Valko

We propose SmoothCruiser, a new planning algorithm for estimating the value function in entropy-regularized Markov decision processes and two-player games, given a generative model of the SmoothCruiser.

Multiagent Evaluation under Incomplete Information

1 code implementation NeurIPS 2019 Mark Rowland, Shayegan Omidshafiei, Karl Tuyls, Julien Perolat, Michal Valko, Georgios Piliouras, Remi Munos

This paper investigates the evaluation of learned multiagent strategies in the incomplete information setting, which plays a critical role in ranking and training of agents.

Recurrent Experience Replay in Distributed Reinforcement Learning

3 code implementations ICLR 2019 Steven Kapturowski, Georg Ostrovski, Will Dabney, John Quan, Remi Munos

Using a single network architecture and fixed set of hyperparameters, the resulting agent, Recurrent Replay Distributed DQN, quadruples the previous state of the art on Atari-57, and surpasses the state of the art on DMLab-30.

Atari Games reinforcement-learning +1

α-Rank: Multi-Agent Evaluation by Evolution

1 code implementation4 Mar 2019 Shayegan Omidshafiei, Christos Papadimitriou, Georgios Piliouras, Karl Tuyls, Mark Rowland, Jean-Baptiste Lespiau, Wojciech M. Czarnecki, Marc Lanctot, Julien Perolat, Remi Munos

We introduce {\alpha}-Rank, a principled evolutionary dynamics methodology, for the evaluation and ranking of agents in large-scale multi-agent interactions, grounded in a novel dynamical game-theoretic solution concept called Markov-Conley chains (MCCs).

Mathematical Proofs

The Termination Critic

no code implementations26 Feb 2019 Anna Harutyunyan, Will Dabney, Diana Borsa, Nicolas Heess, Remi Munos, Doina Precup

In this work, we consider the problem of autonomously discovering behavioral abstractions, or options, for reinforcement learning agents.

Maximum a Posteriori Policy Optimisation

3 code implementations ICLR 2018 Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, Martin Riedmiller

We introduce a new algorithm for reinforcement learning called Maximum aposteriori Policy Optimisation (MPO) based on coordinate ascent on a relative entropy objective.

Continuous Control reinforcement-learning +1

Low-pass Recurrent Neural Networks - A memory architecture for longer-term correlation discovery

no code implementations13 May 2018 Thomas Stepleton, Razvan Pascanu, Will Dabney, Siddhant M. Jayakumar, Hubert Soyer, Remi Munos

Reinforcement learning (RL) agents performing complex tasks must be able to remember observations and actions across sizable time intervals.

Reinforcement Learning (RL)

A Study on Overfitting in Deep Reinforcement Learning

1 code implementation18 Apr 2018 Chiyuan Zhang, Oriol Vinyals, Remi Munos, Samy Bengio

We conclude with a general discussion on overfitting in RL and a study of the generalization behaviors from the perspective of inductive bias.

Inductive Bias reinforcement-learning +1

The Uncertainty Bellman Equation and Exploration

1 code implementation ICML 2018 Brendan O'Donoghue, Ian Osband, Remi Munos, Volodymyr Mnih

In this paper we consider a similar \textit{uncertainty} Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps.

Noisy Networks for Exploration

15 code implementations ICLR 2018 Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, Shane Legg

We introduce NoisyNet, a deep reinforcement learning agent with parametric noise added to its weights, and show that the induced stochasticity of the agent's policy can be used to aid efficient exploration.

Atari Games Efficient Exploration +2

The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning

no code implementations ICLR 2018 Audrunas Gruslys, Will Dabney, Mohammad Gheshlaghi Azar, Bilal Piot, Marc Bellemare, Remi Munos

Our first contribution is a new policy evaluation algorithm called Distributional Retrace, which brings multi-step off-policy updates to the distributional reinforcement learning setting.

Atari Games Distributional Reinforcement Learning +1

Automated Curriculum Learning for Neural Networks

no code implementations ICML 2017 Alex Graves, Marc G. Bellemare, Jacob Menick, Remi Munos, Koray Kavukcuoglu

We introduce a method for automatically selecting the path, or syllabus, that a neural network follows through a curriculum so as to maximise learning efficiency.

Count-Based Exploration with Neural Density Models

1 code implementation ICML 2017 Georg Ostrovski, Marc G. Bellemare, Aaron van den Oord, Remi Munos

This pseudo-count was used to generate an exploration bonus for a DQN agent and combined with a mixed Monte Carlo update was sufficient to achieve state of the art on the Atari 2600 game Montezuma's Revenge.

Montezuma's Revenge

Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning

no code implementations NeurIPS 2016 Jean-bastien Grill, Michal Valko, Remi Munos

We study the sampling-based planning problem in Markov decision processes (MDPs) that we can access only through a generative model, usually referred to as Monte-Carlo planning.

Learning to reinforcement learn

8 code implementations17 Nov 2016 Jane. X. Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z. Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, Matt Botvinick

We unpack these points in a series of seven proof-of-concept experiments, each of which examines a key aspect of deep meta-RL.

Meta-Learning Meta Reinforcement Learning +2

Combining policy gradient and Q-learning

no code implementations5 Nov 2016 Brendan O'Donoghue, Remi Munos, Koray Kavukcuoglu, Volodymyr Mnih

Policy gradient is an efficient technique for improving a policy in a reinforcement learning setting.

Atari Games Q-Learning

Sample Efficient Actor-Critic with Experience Replay

8 code implementations3 Nov 2016 Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, Nando de Freitas

This paper presents an actor-critic deep reinforcement learning agent with experience replay that is stable, sample efficient, and performs remarkably well on challenging environments, including the discrete 57-game Atari domain and several continuous control problems.

Continuous Control reinforcement-learning +1

Memory-Efficient Backpropagation Through Time

2 code implementations NeurIPS 2016 Audrūnas Gruslys, Remi Munos, Ivo Danihelka, Marc Lanctot, Alex Graves

We propose a novel approach to reduce memory consumption of the backpropagation through time (BPTT) algorithm when training recurrent neural networks (RNNs).

Q($λ$) with Off-Policy Corrections

no code implementations16 Feb 2016 Anna Harutyunyan, Marc G. Bellemare, Tom Stepleton, Remi Munos

We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of transition probabilities.

Black-box optimization of noisy functions with unknown smoothness

no code implementations NeurIPS 2015 Jean-bastien Grill, Michal Valko, Remi Munos

We study the problem of black-box optimization of a function $f$ of any dimension, given function evaluations perturbed by noise.

Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

no code implementations17 Sep 2015 Assaf Hallak, Aviv Tamar, Remi Munos, Shie Mannor

We consider the off-policy evaluation problem in Markov decision processes with function approximation.

Off-policy evaluation

Efficient learning by implicit exploration in bandit problems with side observations

no code implementations NeurIPS 2014 Tomáš Kocák, Gergely Neu, Michal Valko, Remi Munos

As the predictions of our first algorithm cannot be always computed efficiently in this setting, we propose another algorithm with similar properties and with the benefit of always being computationally efficient, at the price of a slightly more complicated tuning mechanism.

Combinatorial Optimization

Optimistic Planning in Markov Decision Processes Using a Generative Model

no code implementations NeurIPS 2014 Balázs Szörényi, Gunnar Kedenburg, Remi Munos

We consider the problem of online planning in a Markov decision process with discounted rewards for any given initial state.

Bounded Regret for Finite-Armed Structured Bandits

no code implementations NeurIPS 2014 Tor Lattimore, Remi Munos

We study a new type of K-armed bandit problem where the expected return of one arm may depend on the returns of other arms.

Active Regression by Stratification

no code implementations NeurIPS 2014 Sivan Sabato, Remi Munos

We propose a new active learning algorithm for parametric linear regression with random design.

Active Learning General Classification +1

On Minimax Optimal Offline Policy Evaluation

no code implementations12 Sep 2014 Lihong Li, Remi Munos, Csaba Szepesvari

This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy.

Multi-Armed Bandits Off-policy evaluation

Bandit Algorithms for Tree Search

no code implementations9 Aug 2014 Pierre-Arnuad Coquelin, Remi Munos

Then, we introduce and analyze a Bandit Algorithm for Smooth Trees (BAST) which takes into account ac- tual smoothness of the rewards for perform- ing efficient "cuts" of sub-optimal branches with high confidence.

Efficient Exploration Game of Go

Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem

no code implementations12 Dec 2013 Masrour Zoghi, Shimon Whiteson, Remi Munos, Maarten de Rijke

This paper proposes a new method for the K-armed dueling bandit problem, a variation on the regular K-armed bandit problem that offers only relative feedback about pairs of arms.

Information Retrieval Retrieval

Finite-Time Analysis of Kernelised Contextual Bandits

no code implementations26 Sep 2013 Michal Valko, Nathaniel Korda, Remi Munos, Ilias Flaounas, Nelo Cristianini

For contextual bandits, the related algorithm GP-UCB turns out to be a special case of our algorithm, and our finite-time analysis improves the regret bound of GP-UCB for the agnostic case, both in the terms of the kernel-dependent quantity and the RKHS norm of the reward function.

Multi-Armed Bandits

Thompson Sampling for 1-Dimensional Exponential Family Bandits

no code implementations NeurIPS 2013 Nathaniel Korda, Emilie Kaufmann, Remi Munos

Thompson Sampling has been demonstrated in many complex bandit models, however the theoretical guarantees available for the parametric multi-armed bandit are still limited to the Bernoulli case.

Thompson Sampling

Cannot find the paper you are looking for? You can Submit a new open access paper.