## Description

Recall that the TD( λ ) estimator for an MDP can be thought of as a weighted combination of the

k-step estimators E for k ≥ 1. k

Consider the MDP described by the following state diagram. (Assume the discount factor is γ = 1.)

Procedure

● Find a value of λ , strictly less than 1, such that the TD estimate for λ equals that of the

TD(1) estimate. Round your answer for λ to three decimal places.

● This HW is designed to help solidify your understanding of the Temporal Difference

algorithms and k-step estimators. You will be given the probability to State 1 and a vector

of rewards {r0, r1, r2, r3, r4, r5, r6}

● You will be given 10 test cases for which you will return the best lambda value for each.

Your answer will be graded to 0.001 precision. You may use any programming language

and libraries you wish.

1

Examples

The following examples can be used to verify your calculation is correct.

● Input: probToState=0.81, valueEstimates={0.0,4.0,25.7,0.0,20.1,12.2,0.0},

rewards={7.9,-5.1,2.5,-7.2,9.0,0.0,1.6}, Output: 0.6226326309908364

● Input: probToState=0.22, valueEstimates={0.0,-5.2,0.0,25.4,10.6,9.2,12.3},

rewards={-2.4,0.8,4.0,2.5,8.6,-6.4,6.1}, Output: 0.49567093118984556

● Input: probToState=0.64, valueEstimates={0.0,4.9,7.8,-2.3,25.5,-10.2,-6.5},

rewards={-2.4,9.6,-7.8,0.1,3.4,-2.1,7.9}, Output: 0.20550275877409016

Resources

The concepts explored in this homework are covered by:

● Lectures

○ Lesson 3: TD and Friends

● Readings

○ Sutton (1988)

Submission Details

To complete the assignment calculate answers to the specific problems given, and submit

results, at

https://rldm.herokuapp.com

2