TD Learning | Notion

Introduction

In this paper I recreate the results from Sutton’s 1988 paper, “Learning to Predict by the Methods of Temporal Differences.” Temporal Difference methods are concerned with learning to predict, that is using past experiences of an incompletely known system to predict its behavior. TD methods use the error between temporarily successive predictions for credit assignment and are thus better at classifying rare and novel states when compared to traditional supervised learning techniques. Another advantage is that TD agents learn directly from episodes of experience, so the training data does not require additional processing prior to learning.

Method

For our experiments we consider one of the simplest dynamical systems, a bounded random walk, and seek to estimate the value of each state. Thus, our agent is attempting to learn the behavior of a Markov Process consisting of seven nodes. The transition probabilities are equal and the terminal states of A, D correspond with rewards 0,1 respectively. Both experiments consist of 100 training sets each containing 10 training sequences.

Weight updates are computed incrementally by comparing the error of temporally successive predictions and multiplying that by a step size parameter and the sum of the partial derivatives, with respect to w, of our past predictions multiplied by a sequence of exponentially decaying lambdas. Note the computational advantages of the incremental calculation of the delta w’s.

Repeated presentations

In the first experiment we accumulated delta w’s over all sequences in a training set and only then used them to update the weight vector. Additionally, training sequences were repeatedly presented until they stopped making significant changes to the weight vector. We used a sufficiently small value for alpha to guarantee this convergence. As a result of this method the initial values of our weight vector have no effect on this experiment because the weights will always converge to the same values.

Single presentation

The second experiment performed weight updates for each example in a training sequence, instead of waiting until the end of the sequence, and each sequence was only presented once. At the beginning of each sequence weights were set to .5 to avoid basis toward either termination state. Lastly, we averaged the root mean squared error over all 100 training sets.

Results and Conclusion

To obtain our results, we calculated the root mean square error between our computed weights and the ideal predictions. In our first experiment, we computed weights for lambda values [0,.1,.3,.5,.7,.9, 1] using TD(λ) with repeated presentations to produce figure 3. For the second experiment we computed weights for every combination of lambda [0,.1,.3,.5,.7,.9, 1] and alpha [0,.1,.2,.3,.4,.5,.6,.7,.8,.9, 1] using TD(λ) with single presentations to produce figure 4. We used a similar process to create figure 5, instead taking the lowest error across all the values of alpha.

Replication of Figure 3 Sutton 1988

Like Sutton, we observe a rapid improvement in performance as lambda is reduced below 1 with the best performance occurring at 0. Interestingly, our results had lower errors than Sutton’s did, which may be due to our lower choice of epsilon.

Replication of Figure 4 Sutton 1988

Our results for Figure 4 are also consistent with Sutton’s. For all values of alpha, TD(1) preformed the worst as it minimizes error on the training set and is prone to overfitting. This experiment demonstrated how the value of alpha has a significant effect on performance.