Description
1 Theory
Question 1. [25 points] Bias-variance trade-off (Choose one of the following two
tracks)
A. Monte-Carlo (MC) methods are defined by forming estimates of returns (Gt) by
performing rollouts in the environment (to collect trajectories), usually without using
concepts from Bellman equations. Temporal Difference (TD) methods are similar to
MC methods, except they use a “bootstrapped” target, i.e. the target value depends
on the current estimate of the value function.
i Explain how these differences affect the learning process, specifically the biasvariance trade-off. For each method, comment how the estimated value function
depends on the amount of data available to the agent. Also, briefly explain the
scalability of these approaches with experience. Relate your answer to TD(λ)
methods.
ii MC methods are known to have bounds that are independent on the size of the
state space:
|vb(s) − v
π
(s)| 6
Rmax
1 − γ
r
1
2n
ln 2
δ
, (1)
∀s ∈ S, where n is the sample size, rewards are in [0, Rmax], vb is the MC estimate
of the value function for policy π, v
π
is the true value function of this policy and
we know the bound holds with probability 1 − δ. Explain why this is the case,
along with the advantages and disadvantages of it. In your answer, focus on how
one would learn the optimal policy using MC methods and the sample complexity
of such an approach.
iii Explain why/how TD methods can be viewed as learning an implicit model of
the world and solving for it at the same time. Describe why this leads to different
solutions for MC and TD.
B. In this question we will bound the bias and variance of phased TD updates
(Kearns and Singh, 2000). Phased TD algorithm simplifies the complexities introduced
1
Assignment 2
February 9, 2020 Reinforcement Learning
COMP 767
Due: Feb 28, 2020
by learning rate α. In phased TD, we are provided with n trajectories generated
following a policy π from every state at each phase. In other words, each phased
update is carried out using the data from n trajectories. The phased value function is
then obtained by averaging the updates on these trajectories. Specifically, we define
phased TD(k) and phased TD(λ) respectively as below,
v¯
k
π,t+1(s) = 1
n
Xn
i=1
X
k−1
j=1
γ
j−1
r
(i)
j+1 + γ
k
v¯
k
π,t(s
(i)
n+1),
v¯
λ
π,t+1(s) = 1
n
Xn
i=1
(1 − λ)
X∞
k=1
λ
k−1
X
k−1
j=1
γ
j−1
r
(i)
j+1 + γ
k
v¯
λ
π,t(s
(i)
k+1)
.
i Let S(t) be the set of trajectories generated by π in phase t (n trajectories
from each state), v¯
k
π
(s, t) be the value function estimate of phased TD(k) after
phase t, and let ∆t = max
s
(|v¯
k
π,t(s) − vπ(s)|). Then for any γ ∈ [0, 1), rewards
r ∈ [−1, +1], 0 < δ < 1 and ε =
q2log(2k/δ)
n
, with probability 1 − δ, prove that
∆t 6 ε
1−γ
k
1−γ
+ γ
k∆t−1.
ii Let S(t) be the set of trajectories generated by π in phase t (n trajectories from
each state), v¯
λ
π
(s, t) be the value function estimate of phased TD(λ) after phase t,
and let ∆t = max
s
(|v¯
λ
π,t(s) − vπ(s)|). Then for any γ ∈ [0, 1), λ ∈ [0, 1), rewards
r ∈ [−1, +1], 0 < δ < 1 and ε =
q2log(2k/δ)
n
, with probability 1 − δ, prove that
∆t 6 ε min
k
1−(γλ)
k
1−γλ
+
(γλ)
k
1−γλ 1
+
(1−λ)γ
1−γλ
∆t−1.
Tip: Use Hoeffding and union bound
Question 2. [25 points] (Choose one of the following two tracks)
A. SARSA/Expected SARSA/Q-learning with Function Approximation
i Outline a SARSA algorithm with function approximation that uses Boltzman
exploration instead of the standard ε-greedy policy.
ii Explain why SARSA with ε-greedy policy can fail to converge and why Boltzman
exploration can help in this respect.(see Gordon, 1996)
Assignment 2
February 9, 2020 Reinforcement Learning
COMP 767
Due: Feb 28, 2020
iii Explain the connection between SARSA and Q-learning. Show how Q-Learning
can be viewed as a special case of Expected Sarsa.
B. Certainty-equivalence RL
Certainty-equivalence is a model-based RL algorithm which means the algorithm
will first estimate a reward model R(s, a) and a transition model model P(s, a, ·)
from the available data. In tabular setting, given a dataset of trajectories, D =
{(s1, a1, r1, s2, . . . , sH+1)}, for a horizon H, we first convert it into chunks of transition
tuples, {(s, a, r, s0
)}. So, for a trajectory of horizon H, we have H tuples:
(s1, a1, r1, s2),(s2, a2, r2, s3), . . .(sH, aH, rH, sH+1)
Using these chunks the tabular certainty-equivalence model estimates the transition
model and the reward model based on maximum likelihood which corresponds to the
empirical frequency of the dataset of the observed transitions as shown below:
Pb(s, a) = 1
|Ds,a|
X
(r,s0)∈Ds,a
es (2)
Rb(s, a) = 1
|Ds,a|
X
(r,s0)∈Ds,a
r,
where es is a unit vector whose entry for state s is 1 and all other entries are 0, define
Ds,a as the subset of tuples where the first element of the tuple is s and the second is
a, we write (r, s0
) ∈ Ds,a. Let n = |Ds,a| and note that the frequency of each state
and action pair n(s, a) > 0, ∀s ∈ S and ∀a ∈ A. The algorithm then performs policy
evaluation or control using these models.
The other popular family of RL algorithms estimates the q-value functions, for
example Q-learning and Sarsa, without estimating a model.
i Compare the two approaches by specifying the advantages and disadvantages in
terms of performance, memory, and time complexity.
ii We are interested in deriving high-probability guarantees for for the optimal policy
in Mc = (S, A, P , b R, γ b ) as a function of n. When n is large enough Rb ≈ R and
Pb ≈ P with probability at least 1 − δ (δ ∈ (0, 1)), one can prove the following
3
Assignment 2
February 9, 2020 Reinforcement Learning
COMP 767
Due: Feb 28, 2020
result using union bound and Hoeffding’s inequality
max
s,a
|Rb(s, a) − R(s, a)| 6 Rmaxr
1
2n
ln 4|S × A|
δ
(3)
max
s,a,s0
Pb (s
0
|s, a) − P (s
0
|s, a)
6
r
1
2n
ln 4|S × A × S|
δ
To prove the suboptimality of π
?
Mc, you should first try to prove the simulation
lemma (which bounds the policy evaluation error in the learned MDP) for all
s ∈ S
V
π
Mc − V
π
M
∞
6
εR
1 − γ
+
γεPRmax
2(1 − γ)
2
, (4)
where maxs,a |Rb(s, a) − R(s, a)| 6 εR and maxs,a kPb(s, a) − P(s, a)k1 6 εP .
iii We are now interested in going from policy evaluation to the control setting. We
want to bound the error of the optimal policy derived in the approximated MDP
Mc: π
?
M¯
. Prove that
∀s ∈ S, V ?
M(s) − V
π
?
M¯
M (s) 6 2 sup
π:S→A
V
π
Mc − V
π
M
∞
(5)
iv Putting (ii) and (iii), along with concentration inequalities, prove the following
V
?
M(s) − V
π
?
M¯
M (s) = O
|S|
√
n(1 − γ)
2
, ∀s ∈ S (6)
2 Coding
Question 1. [25 points] Continuous Random Walk (Prediction task)
In this task, a state, st
, is defined as a point in the interval 0 to 1. This is an episodic
task i.e, the episode ends when either state 0 or 1 is exceeded. The episode starts at
state 0.5. At each step, the agent moves up or down by a uniformly selected random
step in the interval [-0.2, 0.2]. The reward rt
is 0 everywhere except when the agent
terminates. When the agent terminates, the reward is equal to the termination state.
For example, if agent terminates in a state -0.05, then it receives a reward of -0.05.
Set the discount rate γ to 1.
4
Assignment 2
February 9, 2020 Reinforcement Learning
COMP 767
Due: Feb 28, 2020
• Implement sparse coarse coding (refer to section 9.5.4 Sutton and Barto, 2018)
to construct binary features from the real valued state. You are required to
follow the following protocol: Divide the state space [0,1] into 10 equal sized
intervals. Also add an additional interval in order to allow for an offset of the
whole tiling. Repeat this 10 times to obtain 10 different tilings each offset by a
different randomly selected fraction of the interval. In addition, add a feature
corresponding to each interval that takes the value 1 when the state was within
that tile, and 0 otherwise.
• Using the features constructed, implement TD(λ) with accumulating traces (see
Chapter 12 Sutton and Barto, 2018) for various values of λ and learning rate α.
Use a linear function approximator that is initialized to predict 0. You must
run the experiment on 50 different seeds for each combination of λ and α values
and average the results.
• Present your results in a plot where the x-axis is learning rate and the y-axis is
Mean Squared Value Error (MSVE) after the agent is trained for 25 episodes.
Since the state space is continuous, sample 21 different points in the interval
[0,1], evenly spaced at 0.05. Also, note that the correct predictions are equal to
the position. Note that for each λ value you will get a curve that is averaged
over 50 different seeds. Plot the averaged result, and also show the confidence
interval (using the standard deviation from the all the runs).
• What do you observe? Write a brief report on your observations.
Question 2. [25 points] Control task (Choose one of the following two tracks)
A. Use the mountain car environment from Open AI gym (Brockman et al., 2016).
Set the discount factor γ to 1.
• Construct features from the 2D continuous state space using grid tilings. Construct the features in exactly the same way as section 10.1 of Sutton’s book
(Sutton and Barto, 2018). To summarize the method, you are required to use 8
tilings, with each tile 1/8th of the bounded distance (refer to the section 10.1
for details) in each dimension and asymmetric offsets. As a result, you obtain
features ϕ(s, a).
• Using the features constructed, implement SARSA(λ) with replacing traces (see
Chapter 12 Sutton and Barto, 2018) for various values of λ and learning rate
α. Use a linear function approximator that is initialized to predict 0. Run the
5
Assignment 2
February 9, 2020 Reinforcement Learning
COMP 767
Due: Feb 28, 2020
experiment on 20 different seeds for each combination of λ and α values. For
exploration use an ε-greedy policy with ε = 0.1.
• Present your results in a plot where the x-axis is learning rate and the y-axis
is number of steps per episode to reach the goal once the algorithm is trained
on 1000 episodes (this will be maximum steps i.e. 2000 steps if the goal is
not reached). Note that for each λ value you will get a curve that is averaged
over 20 different seeds. Also show the confidence interval (using the standard
deviation obtained from the runs).
• What do you observe? Write a brief report on your observations.
B. Use the cartpole environment from Open AI gym (Brockman et al., 2016). Set
the discount factor γ to 0.9.
• Implement 4-step SARSA, 4-step Expected SARSA, and 4-step q-learning
algorithms (see chapter 10 Sutton and Barto, 2018) with a two layered Neural
Network (64 hidden units in each layer) to learn the task. Use experience replay
to stabilize the learning process. For exploration use an ε-greedy policy with
ε = 0.1.
• Run all 3 algorithms with the following sizes of experience replay buffers: 50,
100, 250, 500.
• Present your results in two different plots. (1) In the first plot, the x-axis should
show the learning rate and the y-axis the average return on the last 10 episodes
of training (you must train until convergence). In this plot, you will have 4
different lines: one line for each variant of experience replay. (2) For the second
plot, choose the best combination of experience replay and learning rate. For
this combination, present a plot with episodes in x-axis and the return obtained
in the corresponding episode as y-axis. Run all the experiments on 20 different
seeds. Also show the confidence interval (using the standard deviation obtained
from the runs).
• What do you observe? Write a brief report on your observations.
References
Brockman, Greg et al. (2016). OpenAI Gym. eprint: arXiv:1606.01540.
6
Assignment 2
February 9, 2020 Reinforcement Learning
COMP 767
Due: Feb 28, 2020
Gordon, Geoffrey J. (1996). Chattering in SARSA(lambda) – A CMU Learning Lab
Internal Report. Tech. rep.
Kearns, Michael J and Satinder P Singh (2000). “Bias-Variance Error Bounds for
Temporal Difference Updates.” In: COLT. Citeseer, pp. 142–147.
Sutton, Richard S and Andrew G Barto (2018). Reinforcement learning: An introduction. MIT press.
7