Description

5/5 - (3 votes)

In this assignment you will implement episodic semi-gradient Double Q-learning with tile
coding and �-greedy action selection to solve the mountain-car problem. Start with your
existing code for running Double Q-learning that you wrote in P1 and for tile coding that
you wrote in P2 (copy these files to a new directory and edit them into new versions).
Use the starter code in learning.py. The code for the Mountain Car problem is available
in the dropbox folder as mountaincar.py. The three actions (decelerate, coast, and
accelerate) are represented by the integers 0, 1, and 2. The states are represented by
tuples of two doubles corresponding to the position and velocity of the car.
mountaincar.py provides two functions:
• mountaincar.init(), which takes no arguments and returns the initial state. In
this case, the initial position is randomly chosen from [–0.6,–0.4) (near the bottom
of the hill) and the initial velocity is zero.
• mountaincar.sample(S,A) –> (R,S’), which returns a tuple of a reward and a
next state, corresponding to taking action A in state S. Arrival in the terminal state is
indicated by S’ = None. In mountain car the rewards and transitions are
deterministic.
You will need to change your tile coder so that it covers the 2D state space for the car
position and velocity as given in the textbook (Section 9.5.4). To start with, use the
following parameters:
• numTilings = 4
• shape/size of tilings = 9 x 9, scaled to that an 8×8 subset exactly fills the allowed
state space
• α = 0.1/numTilings
• � = 0
• initial weights = random numbers between 0 and –0.001
Note that γ =1 in this formulation of the problem and cannot be changed.
There is no explicit description of episodic semi-gradient Double Q-learning in the book.
You will have to use what is in the book to imagine the natural way to extend your Qlearning algorithm (page 143) to the semi-gradient, linear case, and then use tile coding
to produce the feature vectors. The number of components of the parameter vector, n, is
the total number of features, which is, with the standard parameters above, 4 x (9 x 9) x
3 (numTilings x tilesPerTiling x numActions). Basically, you call your tile coder on the
state to get a list of four tile indices. These are numbers between 0 and 4 x 9 x 9 – 1. If
the action is 0, then these four are the places where � is 1 (elsewhere 0). If the action is
action 1, then you add 4 x 9 x 9 to these numbers to get the places where � is 1. And if
the action is 2 then you add twice that. Basically, you are shifting the 1 indices into a
unique third of the feature vector depending on which action is specified. This will pick
out a different third of the parameter vector, �, for learning about each action.
Once your code is working, try a run of 1000 episodes. The initial episodes will be quite
long, but eventually a good solution should be found wherein episodes are 200 steps
long or less. After good performance is reached, make a 3D plot of minus the learned
state values. That is, plot
as a function of the state, over the range of allowed positions and velocities as given in the
book. Use may use the provided function writeF and the file plot.py to make the 3D plot.
Now add an outer loop and run 50 independent runs of 200 episodes each, with the weights
reset at the beginning of each run. Use Excel or some other plotting package to produce a
graph of the average (over runs) of the return and of number of steps, versus episode number.
You may use the provided plotreturns.gnuplot to make this plot.
What to turn in. Turn in your modified versions of learning.py and Tilecoder.py, and your
3D plot and your learning curve in a file named P3.pdf.
Extra Credit
Experiment with changing the parameters from the values listed above to see if you can
get faster learning or better final performance than is obtained with the original
parameter settings. You can also change the kind of traces used and the tile-coding
strategy (number of tilings, size and shape of the tiles). As an overall measure of
performance on a run, use the sum of all the rewards received in the first 200 episodes.
If you can find a set of parameters that improves this performance measure by two
standard errors, then you will earn an extra 8 points (out of 72 total on the project). To
show the improvement, you must do many runs with the standard parameters and then
many runs with your parameters, and measure the mean performance and standard
error in each case (a standard error is the standard deviation of the performance
numbers divided by the square root of the number of runs). If the difference between the
means is greater than 2.5 times the larger of the two standard errors, then you have
shown that your parameters are significantly better. It is permissible to use any number
of runs greater or equal to 50 (note that larger numbers of runs will tend to make the
standard errors smaller).
maxa qˆ(s, a, ✓) = maxa ✓>(s, a)
To collect your extra credit, report the alternate parameter settings that you found, the
means and standard errors you obtained for the two sets of parameters, and the
number of runs you used in each case. A file extraCredit.txt is provided and you have to
fill that file in. You can also hand in a file named ExtraCreditTilecoder.py if you like
(optional even if you are going for extra credit).
Extra Extra Credit (for 366 students only)
Finally, once you have found your favourite set of parameters, make a learning curve
based on 500 runs of 200 episodes with them. Provide a printout of this curve, along
with its single average performance number and its standard error on these 500 runs.
This will be used to compare your team’s performance with that of the other teams. The
top three teams will receive additional extra credit:
1st place: +12 points on the project
2nd place: +8 points
3rd place: +4 points
How and what to hand in:
Submit your assignment on Gradescope. Make sure your code runs in a Python 3
interpreter. Comment all the lines where you print something in the submitted files
(your scripts should not print anything to the screen). You should submit your
Tilecoder.py, learning.py, and extraCredit.txt in “Programming Assignment 3
Code” in Gradescope. You should also submit the filled extraCredit.txt if you are
going for the extra credit. Use the updated templates. You also may need to submit
collaborator.txt if you have a collaborator. You may also submit a
ExtraCreditTilecoder.py if you want. Finally, submit your P3.pdf in “Programming
Assignment 3 PDFs (Marked)” in Gradescope. Gradescope submission will be closed
after the deadline and you need to email your assignment to Mohammad Ajallooeian
(TA) if you are using slip days.

COMP366 P3 -Mountain Car Programming Project

Description

Related products

COMP366 P2 – Supervised Learning with Tile Coding

COMP366 Project p1: Blackjack (in python)