CS 234 Assignment 1 solved

$35.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (5 votes)

For submission instructions please refer to website. For all problems, if you use an existing result
from either the literature or a textbook to solve the exercise, you need to cite the source.
1 Gridworld [15 pts]
Consider the following grid environment. Starting from any unshaded square, you can move up,
down, left, or right. Actions are deterministic and always succeed (e.g. going left from state 16 goes
to state 15) unless they will cause the agent to run into a wall. The thicker edges indicate walls, and
attempting to move in the direction of a wall results in staying in the same square (e.g. going in
any direction other than left from state 16 stays in 16). Taking any action from the green target
square (no. 12) earns a reward of rg (so r(12, a) = rg ∀a) and ends the episode . Taking any action
from the red square of death (no. 5) earns a reward of rr (so r(5, a) = rr ∀a) and ends the episode.
Otherwise, from every other square, taking any action is associated with a reward rs ∈ {−1, 0, +1}
(even if the action results in the agent staying in the same square). Assume the discount factor
γ = 1, rg = +5, and rr = −5 unless otherwise specified.
(a) (3pts) Define the value of rs that would cause the optimal policy to return the shortest path
to the green target square (no. 12). Using this rs, find the optimal value for each square.
(b) (3pts) Lets refer to the value function derived in (a) as V
πg
old and the policy as πg. Suppose
we are now in a new gridworld where all the rewards (rs, rg, and rr) have +2 added to them.
Consider still following πg of the original gridworld, what will the new values V
πg
new be in this
second gridworld?
1
(c) (3pts) Consider a general MDP with rewards, and transitions. Consider a discount factor of γ.
For this case assume that the horizon is infinite (so there is no termination). A policy π in
this MDP induces a value function V
π
(lets refer to this as V
π
old). Now suppose we have a new
MDP where the only difference is that all rewards have a constant c added to them. Can you
come up with an expression for the new value function V
π
new induced by π in this second MDP
in terms of V
π
old, c, and γ?
(d) (2pts) Lets go back to our gridworld from (a) with the default values for rg, rr, γ and with the
value you specified for rs. Suppose we now derived a second gridworld by adding a constant c
to all rewards (rs, rg, and rr) such that rs = +2. How does the optimal policy change (Just
give a one or two sentence description)? What do the values of the unshaded squares become?
(e) (2pts) Now take the second gridworld from part (d) and change γ such that 0 < γ < 1. Can
the optimal policy change and does it depend on your choice of gamma? (A brief description
is sufficient, no formal proof or mathematical analysis required).
(f) (2pts) Lets go back to our gridworld from (a) with the default values for rg, rr, γ and with
the value you specified for rs. In this gridworld, our optimal policy from any unshaded square
never terminates in the red square. Now suppose rs can take on any real, non-infinite value
and is not restricted to {+1, 0, −1} anymore. Give a value of rs such that there are unshaded
squares starting from which following the optimal policy results in termination in the red
square.
2
2 Value of Different Policies [35 pts]
In many situations such as healthcare or education, we cannot run any arbitrary policy and collect
data from running those policies for evaluation. In these cases, we may need to take data collected
from following one policy and use it to evaluate the value of a different policy. The equality proved
in the following exercise can be an important tool for achieving this. The purpose of this exercise
is to get familiar on how to compare the value of different policies, π1 and π2, on a fixed horizon
MDP. A fixed horizon MDP is an MDP where the agent’s state is reset after H timesteps; H is
called the horizon of the MDP. There is no discount (i.e., γ = 1) and policies are allowed to be
non-stationary, i.e., the action identified by a policy depends on the timestep in addition to the state.
Let xt ∼ π denote the distribution over states at timestep t (for 1 ≤ t ≤ H) upon following policy π
and V
π
t
(xt) denote the value function of policy π in state xt and timestep t, and Qπ
t
(xt
, a) denote
the corresponding Q value associated to action a. As a clarifying example, we denote Ext∼π1 V (xt)
to represent the average value of the value function V (·) over the states at timestep t encountered
upon following policy π1. Please show the following:
V
π1
1
(x1) − V
π2
1
(x1) = X
H
t=1
Ext∼π2

Q
π1
t
(xt
, π1(xt
, t)) − Q
π1
t
(xt
, π2(xt
, t))
Intuition: The above expression can be interpreted in the following way. For concreteness,
assume that π1 is the better policy, i.e., achieving V
π1
1
(x1) ≥ V
π2
1
(x1). Suppose you’re following
policy π2 and you are at timestep t in state xt
. You have the option to follow π1 (the better policy)
until the end of the episode, totalling Q
π1
t
(xt
, π1(xt
, t)) return from the current state-timestep;
or you have the option to follow π2 for one timestep and then follow π1 instead until the end of
the episode (you can follow many other policies of course). This would you give you a “loss” of
Q
π1
t
(xt
, π1(xt
, t)) − Q
π1
t
(xt
, π2(xt
, t)) that originates from following the worse policy π2 instead of π1
in that timestep. Then the equation above means that the value difference of the two policies is the
sum of all the losses induced by following the suboptimal policy for every timestep, weighted by the
expected trajectory of the policy you’re following.
3
3 Fixed Point [25 pts]
In this exercise we will use Cauchy sequences to prove that value iteration will converge to a unique
fixed point (in this case, a value function V ) regardless of the starting point. An element V is a
fixed point for an operator B (in this case the Bellman operator) if performance of B on V returns
V , i.e., BV = V . Recall that the Bellman backup operator B is defined as (in lecture 2):
Vk+1
def
= BVk = max
a
[R(s, a) + γ
X
s
0∈S
p(s
0
|s, a)V
π
k
(s
0
)].
Additionally, in lecture 2, we proved that this Bellman backup is a contraction for γ < 1 on the
infinity norm
kBV 0 − BV 00k∞ ≤ γkV
0 − V
00k∞
for any two value functions V
0 and V
00, meaning if we apply it to two different value functions, the
distance between value functions (in the ∞ norm) shrinks after application of the operator to each
element.
(a) (5pts) Prove by induction that kVn+1 − Vnk∞ ≤ γ
nkV1 − V0k∞
(b) (10pts) Prove that for any c > 0, kVn+c − Vnk∞ ≤
γ
n
1−γ
kV1 − V0k∞
A Cauchy sequence is a sequence whose elements become arbitrarily close to each other as the
sequence progresses. Formally a sequence {an} in metric space X with distance metric d is a Cauchy
sequence if given an ε > 0 there exists k such that if m, n > k then d(am, an) < ε. Real Cauchy
sequences are convergent.
(c) (2pts) Using this information about Cauchy sequences, argue that the sequence V0, V1, … is a
Cauchy sequence and is therefore convergent and must converge to some element V and this
V is a fixed point
(d) (8pts) Show that this fixed point is unique.
4
4 Frozen Lake MDP [25 pts]
Now you will implement value iteration and policy iteration for the Frozen Lake environment from
OpenAI Gym. We have provided custom versions of this environment in the starter code.
(a) (coding) (10 pts) Read through vi_and_pi.py and implement policy_evaluation, policy_improvement
and policy_iteration. The stopping tolerance (defined as maxs |Vold(s) − Vnew(s)|) is tol =
10−3
. Use γ = 0.9. Return the optimal value function and the optimal policy.
(b) (coding) (10 pts) Implement value_iteration in vi_and_pi.py. The stopping tolerance is
tol = 10−3
. Use γ = 0.9. Return the optimal value function and the optimal policy.
(c) (written) (5 pts) Run both methods on the Deterministic-4×4-FrozenLake-v0 and Stochastic4x4-FrozenLake-v0 environments. In the second environment, the dynamics of the world are
stochastic. How does stochasticity affect the number of iterations required, and the resulting
policy?
5