Published on Thu Jan 25 2018

Directly Estimating the Variance of the {\lambda}-Return Using Temporal-Difference Methods

Craig Sherstan, Brendan Bennett, Kenny Young, Dylan R. Ashley, Adam White, Martha White, Richard S. Sutton

This paper investigates estimating the variance of a temporal-difference-learning agent's update target. Most reinforcement learning methods use an estimate of the value function, which captures how good it is for the agent to be in a particular state. Our approach is significantly simpler than prior methods that independently estimate the second moment of the {\lambda}-return.

0
0
0
Abstract

This paper investigates estimating the variance of a temporal-difference learning agent's update target. Most reinforcement learning methods use an estimate of the value function, which captures how good it is for the agent to be in a particular state and is mathematically expressed as the expected sum of discounted future rewards (called the return). These values can be straightforwardly estimated by averaging batches of returns using Monte Carlo methods. However, if we wish to update the agent's value estimates during learning--before terminal outcomes are observed--we must use a different estimation target called the {\lambda}-return, which truncates the return with the agent's own estimate of the value function. Temporal difference learning methods estimate the expected {\lambda}-return for each state, allowing these methods to update online and incrementally, and in most cases achieve better generalization error and faster learning than Monte Carlo methods. Naturally one could attempt to estimate higher-order moments of the {\lambda}-return. This paper is about estimating the variance of the {\lambda}-return. Prior work has shown that given estimates of the variance of the {\lambda}-return, learning systems can be constructed to (1) mitigate risk in action selection, and (2) automatically adapt the parameters of the learning process itself to improve performance. Unfortunately, existing methods for estimating the variance of the {\lambda}-return are complex and not well understood empirically. We contribute a method for estimating the variance of the {\lambda}-return directly using policy evaluation methods from reinforcement learning. Our approach is significantly simpler than prior methods that independently estimate the second moment of the {\lambda}-return. Empirically our new approach behaves at least as well as existing approaches, but is generally more robust.

Fri Jul 05 2019
Artificial Intelligence
Incrementally Learning Functions of the Return
Temporal difference methods are of broader interest because they correspond learning as observed in biological systems. We propose a means of estimating functions of the return using itsmoments, which can be learned online using a modified TD algorithm.
0
0
0
Mon Sep 09 2019
Artificial Intelligence
Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning
We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of grotesquerewards over a number of future time steps. These algorithms bootstrap from the value function for horizon , or some shorter horizon.
0
0
0
Sat Jul 02 2016
Artificial Intelligence
A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning
One of the main obstacles to broad application of reinforcement learning methods is the parameter sensitivity of our core learning algorithms. In many large-scale applications, online computation and function approximation represent key strategies in scaling up reinforcement learning algorithms, the authors say.
0
0
0
Fri Oct 31 2008
Artificial Intelligence
Temporal Difference Updating without a Learning Rate
We derive an equation for temporal difference learning from statistical principles. We test this new learning rule against TD(lambda) and find that it offers superior performance in various settings. We then investigate how to extend our new temporal difference algorithm to reinforcement learning.
0
0
0
Sat Aug 15 2020
Artificial Intelligence
Reducing Sampling Error in Batch Temporal Difference Learning
This paper studies the use of TD(0) to estimate the value function of a given policy from a batch of data. The update following an action is weighted according to the number of times that action occurred in the batch -- not the true probability of the action under the given policy.
0
0
0
Wed Aug 19 2015
Machine Learning
Learning to Predict Independent of Span
Conventional algorithms wait until an outcome is observed to update their predictions. We show that the exact same predictions can be learned in a much more computationally congenial way. We apply this idea to various settings of increasing generality.
0
0
0