Search Results for author: Richard S. Sutton

Found 68 papers, 15 papers with code

A Note on Stability in Asynchronous Stochastic Approximation without Communication Delays

no code implementations22 Dec 2023 Huizhen Yu, Yi Wan, Richard S. Sutton

In this paper, we study asynchronous stochastic approximation algorithms without communication delays.

reinforcement-learning

Iterative Option Discovery for Planning, by Planning

no code implementations2 Oct 2023 Kenny Young, Richard S. Sutton

Discovering useful temporal abstractions, in the form of options, is widely thought to be key to applying reinforcement learning and planning to increasingly complex domains.

Value-aware Importance Weighting for Off-policy Reinforcement Learning

no code implementations27 Jun 2023 Kristopher De Asis, Eric Graves, Richard S. Sutton

Importance sampling is a central idea underlying off-policy prediction in reinforcement learning.

reinforcement-learning

Maintaining Plasticity in Deep Continual Learning

1 code implementation23 Jun 2023 Shibhansh Dohare, J. Fernando Hernandez-Garcia, Parash Rahman, A. Rupam Mahmood, Richard S. Sutton

If deep-learning systems are applied in a continual learning setting, then it is well known that they may fail to remember earlier examples.

Binary Classification Continual Learning +1

On Convergence of Average-Reward Off-Policy Control Algorithms in Weakly Communicating MDPs

no code implementations30 Sep 2022 Yi Wan, Richard S. Sutton

We show two average-reward off-policy control algorithms, Differential Q-learning (Wan, Naik, & Sutton 2021a) and RVI Q-learning (Abounadi Bertsekas & Borkar 2001), converge in weakly communicating MDPs.

Q-Learning

The Alberta Plan for AI Research

no code implementations23 Aug 2022 Richard S. Sutton, Michael Bowling, Patrick M. Pilarski

Herein we describe our approach to artificial intelligence research, which we call the Alberta Plan.

Doubly-Asynchronous Value Iteration: Making Value Iteration Asynchronous in Actions

no code implementations4 Jul 2022 Tian Tian, Kenny Young, Richard S. Sutton

However, Asynchronous VI still requires a maximization over the entire action space, making it impractical for domains with large action space.

Toward Discovering Options that Achieve Faster Planning

no code implementations25 May 2022 Yi Wan, Richard S. Sutton

In a variant of the classic four-room domain, we show that 1) a higher objective value is typically associated with fewer number of elementary planning operations used by the option-value iteration algorithm to obtain a near-optimal value function, 2) our algorithm achieves an objective value that matches it achieved by two human-designed options 3) the amount of computation used by option-value iteration with options discovered by our algorithm matches it with the human-designed options, 4) the options produced by our algorithm also make intuitive sense--they seem to move to and terminate at the entrances of rooms.

The Quest for a Common Model of the Intelligent Decision Maker

no code implementations26 Feb 2022 Richard S. Sutton

It is time to recognize and build on the convergence of multiple diverse disciplines on a substantive common model of the intelligent agent.

Decision Making

A History of Meta-gradient: Gradient Methods for Meta-learning

no code implementations20 Feb 2022 Richard S. Sutton

The history of meta-learning methods based on gradient descent is reviewed, focusing primarily on methods that adapt step-size (learning rate) meta-parameters.

Meta-Learning

Learning Agent State Online with Recurrent Generate-and-Test

no code implementations30 Dec 2021 Amir Samani, Richard S. Sutton

Learning continually and online from a continuous stream of data is challenging, especially for a reinforcement learning agent with sequential data.

Average-Reward Learning and Planning with Options

no code implementations NeurIPS 2021 Yi Wan, Abhishek Naik, Richard S. Sutton

We extend the options framework for temporal abstraction in reinforcement learning from discounted Markov decision processes (MDPs) to average-reward MDPs.

reinforcement-learning Reinforcement Learning (RL)

An Empirical Comparison of Off-policy Prediction Learning Algorithms in the Four Rooms Environment

no code implementations10 Sep 2021 Sina Ghiassian, Richard S. Sutton

In the Rooms task, the product of importance sampling ratios can be as large as $2^{14}$ and can sometimes be two.

Continual Backprop: Stochastic Gradient Descent with Persistent Randomness

1 code implementation13 Aug 2021 Shibhansh Dohare, Richard S. Sutton, A. Rupam Mahmood

The Backprop algorithm for learning in neural networks utilizes two mechanisms: first, stochastic gradient descent and second, initialization with small random weights, where the latter is essential to the effectiveness of the former.

Continual Learning Reinforcement Learning (RL)

An Empirical Comparison of Off-policy Prediction Learning Algorithms on the Collision Task

2 code implementations2 Jun 2021 Sina Ghiassian, Richard S. Sutton

In the middle tier, the five Gradient-TD algorithms and Off-policy TD($\lambda$) were more sensitive to the bootstrapping parameter.

Planning with Expectation Models for Control

no code implementations17 Apr 2021 Katya Kudashkina, Yi Wan, Abhishek Naik, Richard S. Sutton

Our algorithms and experiments are the first to treat MBRL with expectation models in a general setting.

Model-based Reinforcement Learning

Does the Adam Optimizer Exacerbate Catastrophic Forgetting?

1 code implementation15 Feb 2021 Dylan R. Ashley, Sina Ghiassian, Richard S. Sutton

Catastrophic forgetting remains a severe hindrance to the broad application of artificial neural networks (ANNs), however, it continues to be a poorly understood phenomenon.

reinforcement-learning Reinforcement Learning (RL)

Average-Reward Off-Policy Policy Evaluation with Function Approximation

1 code implementation8 Jan 2021 Shangtong Zhang, Yi Wan, Richard S. Sutton, Shimon Whiteson

We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function.

Incremental Policy Gradients for Online Reinforcement Learning Control

no code implementations1 Jan 2021 Kristopher De Asis, Alan Chan, Yi Wan, Richard S. Sutton

Our emphasis is on the first approach in this work, detailing an incremental policy gradient update which neither waits until the end of the episode, nor relies on learning estimates of the return.

Policy Gradient Methods reinforcement-learning +1

Understanding the Pathologies of Approximate Policy Evaluation when Combined with Greedification in Reinforcement Learning

no code implementations28 Oct 2020 Kenny Young, Richard S. Sutton

We demonstrate analytically and experimentally that such pathological behaviours can impact a wide range of RL and dynamic programming algorithms; such behaviours can arise both with and without bootstrapping, and with linear function approximation as well as with more complex parameterized functions like neural networks.

Reinforcement Learning (RL)

Inverse Policy Evaluation for Value-based Sequential Decision-making

no code implementations26 Aug 2020 Alan Chan, Kris de Asis, Richard S. Sutton

In this work, we explore the use of \textit{inverse policy evaluation}, the process of solving for a likely policy given a value function, for deriving behavior from a value function.

Decision Making Q-Learning

Learning and Planning in Average-Reward Markov Decision Processes

1 code implementation29 Jun 2020 Yi Wan, Abhishek Naik, Richard S. Sutton

We introduce learning and planning algorithms for average-reward MDPs, including 1) the first general proven-convergent off-policy model-free control algorithm without reference states, 2) the first proven-convergent off-policy model-free prediction algorithm, and 3) the first off-policy learning algorithm that converges to the actual value function rather than to the value function plus an offset.

Learning Sparse Representations Incrementally in Deep Reinforcement Learning

no code implementations9 Dec 2019 J. Fernando Hernandez-Garcia, Richard S. Sutton

Sparse representations have been shown to be useful in deep reinforcement learning for mitigating catastrophic interference and improving the performance of agents in terms of cumulative reward.

reinforcement-learning Reinforcement Learning (RL)

Discounted Reinforcement Learning Is Not an Optimization Problem

no code implementations4 Oct 2019 Abhishek Naik, Roshan Shariff, Niko Yasui, Hengshuai Yao, Richard S. Sutton

Discounted reinforcement learning is fundamentally incompatible with function approximation for control in continuing tasks.

Misconceptions reinforcement-learning +1

Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

no code implementations9 Sep 2019 Kristopher De Asis, Alan Chan, Silviu Pitis, Richard S. Sutton, Daniel Graves

We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a $\textit{fixed}$ number of future time steps.

Q-Learning reinforcement-learning +1

Planning with Expectation Models

no code implementations2 Apr 2019 Yi Wan, Zaheer Abbas, Adam White, Martha White, Richard S. Sutton

In particular, we 1) show that planning with an expectation model is equivalent to planning with a distribution model if the state value function is linear in state features, 2) analyze two common parametrization choices for approximating the expectation: linear and non-linear expectation models, 3) propose a sound model-based policy evaluation algorithm and present its convergence results, and 4) empirically demonstrate the effectiveness of the proposed planning algorithm.

Model-based Reinforcement Learning

Learning Feature Relevance Through Step Size Adaptation in Temporal-Difference Learning

no code implementations8 Mar 2019 Alex Kearney, Vivek Veeriah, Jaden Travnik, Patrick M. Pilarski, Richard S. Sutton

In this paper, we examine an instance of meta-learning in which feature relevance is learned by adapting step size parameters of stochastic gradient descent---building on a variety of prior work in stochastic approximation, machine learning, and artificial neural networks.

Meta-Learning Representation Learning

Should All Temporal Difference Learning Use Emphasis?

1 code implementation1 Mar 2019 Xiang Gu, Sina Ghiassian, Richard S. Sutton

ETD was proposed mainly to address convergence issues of conventional Temporal Difference (TD) learning under off-policy training but it is different from conventional TD learning even under on-policy training.

Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Target

1 code implementation22 Jan 2019 J. Fernando Hernandez-Garcia, Richard S. Sutton

Our results show that (1) using off-policy correction can have an adverse effect on the performance of Sarsa and $Q(\sigma)$; (2) increasing the backup length $n$ consistently improved performance across all the different algorithms; and (3) the performance of Sarsa and $Q$-learning was more robust to the effect of the target network update frequency than the performance of Tree Backup, $Q(\sigma)$, and Retrace in this particular task.

Q-Learning reinforcement-learning +1

Online Off-policy Prediction

no code implementations6 Nov 2018 Sina Ghiassian, Andrew Patterson, Martha White, Richard S. Sutton, Adam White

The ability to learn behavior-contingent predictions online and off-policy has long been advocated as a key capability of predictive-knowledge learning systems but remained an open algorithmic challenge for decades.

Predicting Periodicity with Temporal Difference Learning

no code implementations20 Sep 2018 Kristopher De Asis, Brendan Bennett, Richard S. Sutton

Temporal difference (TD) learning is an important approach in reinforcement learning, as it combines ideas from dynamic programming and Monte Carlo methods in a way that allows for online and incremental model-free learning.

Decision Making

Per-decision Multi-step Temporal Difference Learning with Control Variates

no code implementations5 Jul 2018 Kristopher De Asis, Richard S. Sutton

Multi-step temporal difference (TD) learning is an important approach in reinforcement learning, as it unifies one-step TD learning with Monte Carlo methods in a way where intermediate algorithms can outperform either extreme.

Integrating Episodic Memory into a Reinforcement Learning Agent using Reservoir Sampling

no code implementations ICLR 2018 Kenny J. Young, Richard S. Sutton, Shuo Yang

We suggest one advantage of this particular type of memory is the ability to easily assign credit to a specific state when remembered information is found to be useful.

reinforcement-learning Reinforcement Learning (RL)

A Deeper Look at Experience Replay

4 code implementations4 Dec 2017 Shangtong Zhang, Richard S. Sutton

Recently experience replay is widely used in various deep reinforcement learning (RL) algorithms, in this paper we rethink the utility of experience replay.

Atari Games reinforcement-learning +1

Communicative Capital for Prosthetic Agents

no code implementations10 Nov 2017 Patrick M. Pilarski, Richard S. Sutton, Kory W. Mathewson, Craig Sherstan, Adam S. R. Parker, Ann L. Edwards

This work presents an overarching perspective on the role that machine intelligence can play in enhancing human abilities, especially those that have been diminished due to injury or illness.

A First Empirical Study of Emphatic Temporal Difference Learning

no code implementations11 May 2017 Sina Ghiassian, Banafsheh Rafiee, Richard S. Sutton

In this paper we present the first empirical study of the emphatic temporal-difference learning algorithm (ETD), comparing it with conventional temporal-difference learning, in particular, with linear TD(0), on on-policy and off-policy variations of the Mountain Car problem.

GQ($λ$) Quick Reference and Implementation Guide

no code implementations10 May 2017 Adam White, Richard S. Sutton

This document should serve as a quick reference for and guide to the implementation of linear GQ($\lambda$), a gradient-based off-policy temporal-difference learning algorithm.

Policy Iterations for Reinforcement Learning Problems in Continuous Time and Space -- Fundamental Theory and Methods

1 code implementation9 May 2017 Jaeyoung Lee, Richard S. Sutton

Policy iteration (PI) is a recursive process of policy evaluation and improvement for solving an optimal decision-making/control problem, or in other words, a reinforcement learning (RL) problem.

Decision Making Q-Learning +1

On Generalized Bellman Equations and Temporal-Difference Learning

no code implementations14 Apr 2017 Huizhen Yu, A. Rupam Mahmood, Richard S. Sutton

As to its soundness, using Markov chain theory, we prove the ergodicity of the joint state-trace process under nonrestrictive conditions, and we show that associated with our scheme is a generalized Bellman equation (for the policy to be evaluated) that depends on both the evolution of $\lambda$ and the unique invariant probability measure of the state-trace process.

Multi-step Reinforcement Learning: A Unifying Algorithm

no code implementations3 Mar 2017 Kristopher De Asis, J. Fernando Hernandez-Garcia, G. Zacharias Holland, Richard S. Sutton

These methods are often studied in the one-step case, but they can be extended across multiple time steps to achieve better performance.

Q-Learning reinforcement-learning +1

Multi-step Off-policy Learning Without Importance Sampling Ratios

1 code implementation9 Feb 2017 Ashique Rupam Mahmood, Huizhen Yu, Richard S. Sutton

We show that an explicit use of importance sampling ratios can be eliminated by varying the amount of bootstrapping in TD updates in an action-dependent manner.

Learning Representations by Stochastic Meta-Gradient Descent in Neural Networks

no code implementations9 Dec 2016 Vivek Veeriah, Shangtong Zhang, Richard S. Sutton

In this paper, we introduce a new incremental learning algorithm called crossprop, which learns incoming weights of hidden units based on the meta-gradient descent approach, that was previously introduced by Sutton (1992) and Schraudolph (1999) for learning step-sizes.

Incremental Learning

Face valuing: Training user interfaces with facial expressions and reinforcement learning

no code implementations9 Jun 2016 Vivek Veeriah, Patrick M. Pilarski, Richard S. Sutton

The primary objective of the current work is to demonstrate that a learning agent can reduce the amount of explicit feedback required for adapting to the user's preferences pertaining to a task by learning to perceive a value of its behavior from the human user, particularly from the user's facial expressions---we call this face valuing.

BIG-bench Machine Learning reinforcement-learning +1

True Online Temporal-Difference Learning

1 code implementation13 Dec 2015 Harm van Seijen, A. Rupam Mahmood, Patrick M. Pilarski, Marlos C. Machado, Richard S. Sutton

Our results suggest that the true online methods indeed dominate the regular methods.

Atari Games

Learning to Predict Independent of Span

no code implementations19 Aug 2015 Hado van Hasselt, Richard S. Sutton

If predictions are made at a high rate or span over a large amount of time, substantial computation can be required to store all relevant observations and to update all predictions when the outcome is finally observed.

True Online Emphatic TD($λ$): Quick Reference and Implementation Guide

no code implementations25 Jul 2015 Richard S. Sutton

This document is a guide to the implementation of true online emphatic TD($\lambda$), a model-free temporal-difference algorithm for learning to make long-term predictions which combines the emphasis idea (Sutton, Mahmood & White 2015) and the true-online idea (van Seijen & Sutton 2014).

Emphatic Temporal-Difference Learning

no code implementations6 Jul 2015 A. Rupam Mahmood, Huizhen Yu, Martha White, Richard S. Sutton

Emphatic algorithms are temporal-difference learning algorithms that change their effective state distribution by selectively emphasizing and de-emphasizing their updates on different time steps.

An Empirical Evaluation of True Online TD(λ)

no code implementations1 Jul 2015 Harm van Seijen, A. Rupam Mahmood, Patrick M. Pilarski, Richard S. Sutton

Our results confirm the strength of true online TD({\lambda}): 1) for sparse feature vectors, the computational overhead with respect to TD({\lambda}) is minimal; for non-sparse features the computation time is at most twice that of TD({\lambda}), 2) across all domains/representations the learning speed of true online TD({\lambda}) is often better, but never worse than that of TD({\lambda}), and 3) true online TD({\lambda}) is easier to use, because it does not require choosing between trace types, and it is generally more stable with respect to the step-size.

Temporal-Difference Networks

no code implementations NeurIPS 2004 Richard S. Sutton, Brian Tanner

We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions.

World Knowledge

An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

no code implementations14 Mar 2015 Richard S. Sutton, A. Rupam Mahmood, Martha White

In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps.

Weighted importance sampling for off-policy learning with linear function approximation

no code implementations NeurIPS 2014 A. Rupam Mahmood, Hado P. Van Hasselt, Richard S. Sutton

Second, we show that these benefits extend to a new weighted-importance-sampling version of off-policy LSTD(lambda).

Universal Option Models

no code implementations NeurIPS 2014 Hengshuai Yao, Csaba Szepesvari, Richard S. Sutton, Joseph Modayil, Shalabh Bhatnagar

We prove that the UOM of an option can construct a traditional option model given a reward function, and the option-conditional return is computed directly by a single dot-product of the UOM with the reward function.

Temporal-Difference Learning to Assist Human Decision Making during the Control of an Artificial Limb

no code implementations18 Sep 2013 Ann L. Edwards, Alexandra Kearney, Michael Rory Dawson, Richard S. Sutton, Patrick M. Pilarski

In the present work, we explore the use of temporal-difference learning and GVFs to predict when users will switch their control influence between the different motor functions of a robot arm.

Decision Making Reinforcement Learning (RL)

Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping

no code implementations13 Jun 2012 Richard S. Sutton, Csaba Szepesvari, Alborz Geramifard, Michael P. Bowling

Our main results are to prove that linear Dyna-style planning converges to a unique solution independent of the generating distribution, under natural conditions.

Off-Policy Actor-Critic

1 code implementation22 May 2012 Thomas Degris, Martha White, Richard S. Sutton

Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in off-policy gradient temporal-difference learning.

reinforcement-learning Reinforcement Learning (RL)

Multi-timescale Nexting in a Reinforcement Learning Robot

no code implementations6 Dec 2011 Joseph Modayil, Adam White, Richard S. Sutton

The term "nexting" has been used by psychologists to refer to the propensity of people and many other animals to continually predict what will happen next in an immediate, local, and personal sense.

reinforcement-learning Reinforcement Learning (RL)

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

no code implementations NeurIPS 2009 Shalabh Bhatnagar, Doina Precup, David Silver, Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári

We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks.

Q-Learning

A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation

no code implementations NeurIPS 2008 Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári

We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, target policy, and exciting behavior policy, and whose complexity scales linearly in the number of parameters.

A computational model of hippocampal function in trace conditioning

no code implementations NeurIPS 2008 Elliot A. Ludvig, Richard S. Sutton, Eric Verbeek, E. J. Kehoe

For trace conditioning, with no contiguity between stimulus and reward, these long-latency temporal elements are vital to learning adaptively timed responses.

Hippocampus

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

1 code implementation Artificial Intelligence 1999 Richard S. Sutton, Doina Precup, Satinder Singh

In particular, we show that options may be used interchangeably with primitive actions in planning methods such as dynamic programming and in learning methods such as Q-learning.

Q-Learning reinforcement-learning

Learning to Predict by the Methods of Temporal Differences

1 code implementation Machine Learning 1988 Richard S. Sutton

This article introduces a class of incremental learning procedures specialized for prediction that is, for using past experience with an incompletely known system to predict its future behavior.

Incremental Learning

Cannot find the paper you are looking for? You can Submit a new open access paper.