no code implementations • 22 Dec 2023 • Huizhen Yu, Yi Wan, Richard S. Sutton
In this paper, we study asynchronous stochastic approximation algorithms without communication delays.
no code implementations • 2 Oct 2023 • Kenny Young, Richard S. Sutton
Discovering useful temporal abstractions, in the form of options, is widely thought to be key to applying reinforcement learning and planning to increasingly complex domains.
no code implementations • 27 Jun 2023 • Kristopher De Asis, Eric Graves, Richard S. Sutton
Importance sampling is a central idea underlying off-policy prediction in reinforcement learning.
1 code implementation • 23 Jun 2023 • Shibhansh Dohare, J. Fernando Hernandez-Garcia, Parash Rahman, A. Rupam Mahmood, Richard S. Sutton
If deep-learning systems are applied in a continual learning setting, then it is well known that they may fail to remember earlier examples.
no code implementations • 30 Sep 2022 • Yi Wan, Richard S. Sutton
We show two average-reward off-policy control algorithms, Differential Q-learning (Wan, Naik, & Sutton 2021a) and RVI Q-learning (Abounadi Bertsekas & Borkar 2001), converge in weakly communicating MDPs.
no code implementations • 23 Aug 2022 • Richard S. Sutton, Michael Bowling, Patrick M. Pilarski
Herein we describe our approach to artificial intelligence research, which we call the Alberta Plan.
no code implementations • 4 Jul 2022 • Tian Tian, Kenny Young, Richard S. Sutton
However, Asynchronous VI still requires a maximization over the entire action space, making it impractical for domains with large action space.
no code implementations • 25 May 2022 • Yi Wan, Richard S. Sutton
In a variant of the classic four-room domain, we show that 1) a higher objective value is typically associated with fewer number of elementary planning operations used by the option-value iteration algorithm to obtain a near-optimal value function, 2) our algorithm achieves an objective value that matches it achieved by two human-designed options 3) the amount of computation used by option-value iteration with options discovered by our algorithm matches it with the human-designed options, 4) the options produced by our algorithm also make intuitive sense--they seem to move to and terminate at the entrances of rooms.
no code implementations • 26 Feb 2022 • Richard S. Sutton
It is time to recognize and build on the convergence of multiple diverse disciplines on a substantive common model of the intelligent agent.
no code implementations • 20 Feb 2022 • Richard S. Sutton
The history of meta-learning methods based on gradient descent is reviewed, focusing primarily on methods that adapt step-size (learning rate) meta-parameters.
no code implementations • 7 Feb 2022 • Richard S. Sutton, Marlos C. Machado, G. Zacharias Holland, David Szepesvari, Finbarr Timbers, Brian Tanner, Adam White
Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process.
Model-based Reinforcement Learning reinforcement-learning +1
no code implementations • 30 Dec 2021 • Amir Samani, Richard S. Sutton
Learning continually and online from a continuous stream of data is challenging, especially for a reinforcement learning agent with sequential data.
no code implementations • NeurIPS 2021 • Yi Wan, Abhishek Naik, Richard S. Sutton
We extend the options framework for temporal abstraction in reinforcement learning from discounted Markov decision processes (MDPs) to average-reward MDPs.
no code implementations • 10 Sep 2021 • Sina Ghiassian, Richard S. Sutton
In the Rooms task, the product of importance sampling ratios can be as large as $2^{14}$ and can sometimes be two.
1 code implementation • 13 Aug 2021 • Shibhansh Dohare, Richard S. Sutton, A. Rupam Mahmood
The Backprop algorithm for learning in neural networks utilizes two mechanisms: first, stochastic gradient descent and second, initialization with small random weights, where the latter is essential to the effectiveness of the former.
2 code implementations • 2 Jun 2021 • Sina Ghiassian, Richard S. Sutton
In the middle tier, the five Gradient-TD algorithms and Off-policy TD($\lambda$) were more sensitive to the bootstrapping parameter.
no code implementations • 17 Apr 2021 • Katya Kudashkina, Yi Wan, Abhishek Naik, Richard S. Sutton
Our algorithms and experiments are the first to treat MBRL with expectation models in a general setting.
1 code implementation • 15 Feb 2021 • Dylan R. Ashley, Sina Ghiassian, Richard S. Sutton
Catastrophic forgetting remains a severe hindrance to the broad application of artificial neural networks (ANNs), however, it continues to be a poorly understood phenomenon.
1 code implementation • 8 Jan 2021 • Shangtong Zhang, Yi Wan, Richard S. Sutton, Shimon Whiteson
We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function.
no code implementations • 1 Jan 2021 • Kristopher De Asis, Alan Chan, Yi Wan, Richard S. Sutton
Our emphasis is on the first approach in this work, detailing an incremental policy gradient update which neither waits until the end of the episode, nor relies on learning estimates of the return.
no code implementations • 28 Oct 2020 • Kenny Young, Richard S. Sutton
We demonstrate analytically and experimentally that such pathological behaviours can impact a wide range of RL and dynamic programming algorithms; such behaviours can arise both with and without bootstrapping, and with linear function approximation as well as with more complex parameterized functions like neural networks.
no code implementations • 27 Aug 2020 • Katya Kudashkina, Patrick M. Pilarski, Richard S. Sutton
In this article we argue for the domain of voice document editing and for the methods of model-based reinforcement learning.
Model-based Reinforcement Learning reinforcement-learning +1
no code implementations • 26 Aug 2020 • Alan Chan, Kris de Asis, Richard S. Sutton
In this work, we explore the use of \textit{inverse policy evaluation}, the process of solving for a likely policy given a value function, for deriving behavior from a value function.
1 code implementation • 29 Jun 2020 • Yi Wan, Abhishek Naik, Richard S. Sutton
We introduce learning and planning algorithms for average-reward MDPs, including 1) the first general proven-convergent off-policy model-free control algorithm without reference states, 2) the first proven-convergent off-policy model-free prediction algorithm, and 3) the first off-policy learning algorithm that converges to the actual value function rather than to the value function plus an offset.
no code implementations • 9 Dec 2019 • J. Fernando Hernandez-Garcia, Richard S. Sutton
Sparse representations have been shown to be useful in deep reinforcement learning for mitigating catastrophic interference and improving the performance of agents in terms of cumulative reward.
no code implementations • 4 Oct 2019 • Abhishek Naik, Roshan Shariff, Niko Yasui, Hengshuai Yao, Richard S. Sutton
Discounted reinforcement learning is fundamentally incompatible with function approximation for control in continuing tasks.
no code implementations • 9 Sep 2019 • Kristopher De Asis, Alan Chan, Silviu Pitis, Richard S. Sutton, Daniel Graves
We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a $\textit{fixed}$ number of future time steps.
no code implementations • 2 Apr 2019 • Yi Wan, Zaheer Abbas, Adam White, Martha White, Richard S. Sutton
In particular, we 1) show that planning with an expectation model is equivalent to planning with a distribution model if the state value function is linear in state features, 2) analyze two common parametrization choices for approximating the expectation: linear and non-linear expectation models, 3) propose a sound model-based policy evaluation algorithm and present its convergence results, and 4) empirically demonstrate the effectiveness of the proposed planning algorithm.
no code implementations • 8 Mar 2019 • Alex Kearney, Vivek Veeriah, Jaden Travnik, Patrick M. Pilarski, Richard S. Sutton
In this paper, we examine an instance of meta-learning in which feature relevance is learned by adapting step size parameters of stochastic gradient descent---building on a variety of prior work in stochastic approximation, machine learning, and artificial neural networks.
1 code implementation • 1 Mar 2019 • Xiang Gu, Sina Ghiassian, Richard S. Sutton
ETD was proposed mainly to address convergence issues of conventional Temporal Difference (TD) learning under off-policy training but it is different from conventional TD learning even under on-policy training.
1 code implementation • 22 Jan 2019 • J. Fernando Hernandez-Garcia, Richard S. Sutton
Our results show that (1) using off-policy correction can have an adverse effect on the performance of Sarsa and $Q(\sigma)$; (2) increasing the backup length $n$ consistently improved performance across all the different algorithms; and (3) the performance of Sarsa and $Q$-learning was more robust to the effect of the target network update frequency than the performance of Tree Backup, $Q(\sigma)$, and Retrace in this particular task.
no code implementations • 6 Nov 2018 • Sina Ghiassian, Andrew Patterson, Martha White, Richard S. Sutton, Adam White
The ability to learn behavior-contingent predictions online and off-policy has long been advocated as a key capability of predictive-knowledge learning systems but remained an open algorithmic challenge for decades.
no code implementations • 20 Sep 2018 • Kristopher De Asis, Brendan Bennett, Richard S. Sutton
Temporal difference (TD) learning is an important approach in reinforcement learning, as it combines ideas from dynamic programming and Monte Carlo methods in a way that allows for online and incremental model-free learning.
no code implementations • 5 Jul 2018 • Kristopher De Asis, Richard S. Sutton
Multi-step temporal difference (TD) learning is an important approach in reinforcement learning, as it unifies one-step TD learning with Monte Carlo methods in a way where intermediate algorithms can outperform either extreme.
no code implementations • ICLR 2018 • Kenny J. Young, Richard S. Sutton, Shuo Yang
We suggest one advantage of this particular type of memory is the ability to easily assign credit to a specific state when remembered information is found to be useful.
no code implementations • 18 May 2018 • Sina Ghiassian, Huizhen Yu, Banafsheh Rafiee, Richard S. Sutton
We apply neural nets with ReLU gates in online reinforcement learning.
no code implementations • 10 Apr 2018 • Alex Kearney, Vivek Veeriah, Jaden B. Travnik, Richard S. Sutton, Patrick M. Pilarski
In this paper, we introduce a method for adapting the step-sizes of temporal difference (TD) learning.
no code implementations • 16 Feb 2018 • Jaden B. Travnik, Kory W. Mathewson, Richard S. Sutton, Patrick M. Pilarski
The relationship between a reinforcement learning (RL) agent and an asynchronous environment is often ignored.
no code implementations • 25 Jan 2018 • Craig Sherstan, Brendan Bennett, Kenny Young, Dylan R. Ashley, Adam White, Martha White, Richard S. Sutton
This paper investigates estimating the variance of a temporal-difference learning agent's update target.
4 code implementations • 4 Dec 2017 • Shangtong Zhang, Richard S. Sutton
Recently experience replay is widely used in various deep reinforcement learning (RL) algorithms, in this paper we rethink the utility of experience replay.
no code implementations • 10 Nov 2017 • Patrick M. Pilarski, Richard S. Sutton, Kory W. Mathewson, Craig Sherstan, Adam S. R. Parker, Ann L. Edwards
This work presents an overarching perspective on the role that machine intelligence can play in enhancing human abilities, especially those that have been diminished due to injury or illness.
no code implementations • 11 May 2017 • Sina Ghiassian, Banafsheh Rafiee, Richard S. Sutton
In this paper we present the first empirical study of the emphatic temporal-difference learning algorithm (ETD), comparing it with conventional temporal-difference learning, in particular, with linear TD(0), on on-policy and off-policy variations of the Mountain Car problem.
no code implementations • 10 May 2017 • Adam White, Richard S. Sutton
This document should serve as a quick reference for and guide to the implementation of linear GQ($\lambda$), a gradient-based off-policy temporal-difference learning algorithm.
1 code implementation • 9 May 2017 • Jaeyoung Lee, Richard S. Sutton
Policy iteration (PI) is a recursive process of policy evaluation and improvement for solving an optimal decision-making/control problem, or in other words, a reinforcement learning (RL) problem.
no code implementations • 14 Apr 2017 • Huizhen Yu, A. Rupam Mahmood, Richard S. Sutton
As to its soundness, using Markov chain theory, we prove the ergodicity of the joint state-trace process under nonrestrictive conditions, and we show that associated with our scheme is a generalized Bellman equation (for the policy to be evaluated) that depends on both the evolution of $\lambda$ and the unique invariant probability measure of the state-trace process.
no code implementations • 3 Mar 2017 • Kristopher De Asis, J. Fernando Hernandez-Garcia, G. Zacharias Holland, Richard S. Sutton
These methods are often studied in the one-step case, but they can be extended across multiple time steps to achieve better performance.
1 code implementation • 9 Feb 2017 • Ashique Rupam Mahmood, Huizhen Yu, Richard S. Sutton
We show that an explicit use of importance sampling ratios can be eliminated by varying the amount of bootstrapping in TD updates in an action-dependent manner.
no code implementations • 9 Dec 2016 • Vivek Veeriah, Shangtong Zhang, Richard S. Sutton
In this paper, we introduce a new incremental learning algorithm called crossprop, which learns incoming weights of hidden units based on the meta-gradient descent approach, that was previously introduced by Sutton (1992) and Schraudolph (1999) for learning step-sizes.
no code implementations • 9 Jun 2016 • Vivek Veeriah, Patrick M. Pilarski, Richard S. Sutton
The primary objective of the current work is to demonstrate that a learning agent can reduce the amount of explicit feedback required for adapting to the user's preferences pertaining to a task by learning to perceive a value of its behavior from the human user, particularly from the user's facial expressions---we call this face valuing.
1 code implementation • 13 Dec 2015 • Harm van Seijen, A. Rupam Mahmood, Patrick M. Pilarski, Marlos C. Machado, Richard S. Sutton
Our results suggest that the true online methods indeed dominate the regular methods.
no code implementations • 19 Aug 2015 • Hado van Hasselt, Richard S. Sutton
If predictions are made at a high rate or span over a large amount of time, substantial computation can be required to store all relevant observations and to update all predictions when the outcome is finally observed.
no code implementations • 25 Jul 2015 • Richard S. Sutton
This document is a guide to the implementation of true online emphatic TD($\lambda$), a model-free temporal-difference algorithm for learning to make long-term predictions which combines the emphasis idea (Sutton, Mahmood & White 2015) and the true-online idea (van Seijen & Sutton 2014).
no code implementations • 6 Jul 2015 • A. Rupam Mahmood, Huizhen Yu, Martha White, Richard S. Sutton
Emphatic algorithms are temporal-difference learning algorithms that change their effective state distribution by selectively emphasizing and de-emphasizing their updates on different time steps.
no code implementations • 1 Jul 2015 • Harm van Seijen, A. Rupam Mahmood, Patrick M. Pilarski, Richard S. Sutton
Our results confirm the strength of true online TD({\lambda}): 1) for sparse feature vectors, the computational overhead with respect to TD({\lambda}) is minimal; for non-sparse features the computation time is at most twice that of TD({\lambda}), 2) across all domains/representations the learning speed of true online TD({\lambda}) is often better, but never worse than that of TD({\lambda}), and 3) true online TD({\lambda}) is easier to use, because it does not require choosing between trace types, and it is generally more stable with respect to the step-size.
no code implementations • NeurIPS 2004 • Richard S. Sutton, Brian Tanner
We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions.
no code implementations • 14 Mar 2015 • Richard S. Sutton, A. Rupam Mahmood, Martha White
In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps.
no code implementations • NeurIPS 2014 • A. Rupam Mahmood, Hado P. Van Hasselt, Richard S. Sutton
Second, we show that these benefits extend to a new weighted-importance-sampling version of off-policy LSTD(lambda).
no code implementations • NeurIPS 2014 • Hengshuai Yao, Csaba Szepesvari, Richard S. Sutton, Joseph Modayil, Shalabh Bhatnagar
We prove that the UOM of an option can construct a traditional option model given a reward function, and the option-conditional return is computed directly by a single dot-product of the UOM with the reward function.
no code implementations • 18 Sep 2013 • Ann L. Edwards, Alexandra Kearney, Michael Rory Dawson, Richard S. Sutton, Patrick M. Pilarski
In the present work, we explore the use of temporal-difference learning and GVFs to predict when users will switch their control influence between the different motor functions of a robot arm.
no code implementations • 13 Jun 2012 • Richard S. Sutton, Csaba Szepesvari, Alborz Geramifard, Michael P. Bowling
Our main results are to prove that linear Dyna-style planning converges to a unique solution independent of the generating distribution, under natural conditions.
1 code implementation • 22 May 2012 • Thomas Degris, Martha White, Richard S. Sutton
Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in off-policy gradient temporal-difference learning.
no code implementations • 6 Dec 2011 • Joseph Modayil, Adam White, Richard S. Sutton
The term "nexting" has been used by psychologists to refer to the propensity of people and many other animals to continually predict what will happen next in an immediate, local, and personal sense.
no code implementations • NeurIPS 2009 • Shalabh Bhatnagar, Doina Precup, David Silver, Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári
We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks.
no code implementations • NeurIPS 2009 • Hengshuai Yao, Shalabh Bhatnagar, Dongcui Diao, Richard S. Sutton, Csaba Szepesvári
We extend Dyna planning architecture for policy evaluation and control in two significant aspects.
no code implementations • NeurIPS 2008 • Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári
We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, target policy, and exciting behavior policy, and whose complexity scales linearly in the number of parameters.
no code implementations • NeurIPS 2008 • Elliot A. Ludvig, Richard S. Sutton, Eric Verbeek, E. J. Kehoe
For trace conditioning, with no contiguity between stimulus and reward, these long-latency temporal elements are vital to learning adaptively timed responses.
1 code implementation • Artificial Intelligence 1999 • Richard S. Sutton, Doina Precup, Satinder Singh
In particular, we show that options may be used interchangeably with primitive actions in planning methods such as dynamic programming and in learning methods such as Q-learning.
1 code implementation • Machine Learning 1988 • Richard S. Sutton
This article introduces a class of incremental learning procedures specialized for prediction that is, for using past experience with an incompletely known system to predict its future behavior.