# CSE257 : Assignment 2

\$30.00

## Description

1 Stochastic Search (Implementation Involved)
8 Question 1 (16 Points). Consider the following functions:
• f1(x) = x
2
1 + x
2
2
9
• f2(x) = −
1 + cos(12p
x
2
1 + x
2
2
)
0.5(x
2
1 + x
2
2
) + 2 10 (this is called the drop-wave function)
• f3(x) = P50
i=1 x
2
i
11 (yes it’s a 50-dimensional function, don’t freak out)
12 Implement the following algorithms (parameters will be specified below):
13 1. (GD) Gradient Descent with fixed step size α.
14 2. (SA) Simulated Annealing with initial temperature T and the annealing schedule Tk = T /k for each k-th
15 iteration (in the first iteration k = 1).
16 3. (CE) Cross-Entropy Methods with k samples in each iteration, and the elite set is formed by the top 20% of
17 all samples in each iteration (whether “top” means higher or lower function value depends on whether you are
18 minimizing or maximizing).
19 4. (SG) Search gradient with k samples in each iteration, and a fixed learning rate of η = 0.01. (Also, first be clear
20 about whether you are minimizing or maximizing a function, and then decide whether you should do gradient
21 descent or ascent.)
22 For all algorithms, start at the initial point xi = 2 for each variable xi used in the function. For (SA), where the search
23 distribution for the next sample is needed, and (CE) and (SG), where an initial distribution is needed, use the Gaussian
24 distribution N (µ, Σ) with Σ being the identify matrix I of the appropriate size (i.e., if the function uses n variables, I
25 is of size n × n.). You should know what µ should be in each of these cases.
26 Plot the following graphs. For each function, our goal is to minimize the function so make sure your updates are
27 in the right direction. Also for each run of any algorithm, always run 100 iterations from the initial point.
28 1. (1 Point) Plot in 3D the drop-wave function f2 defined above, over the domain (x1, x2) ∈ [−5, 5] × [−5, 5] (the
29 vertical axis should show the value of f2(x1, x2) over this domain).
30 2. (2 Point) Perform GD with step size α = 0.01 on the three functions. Plot how the function value changes
31 with respect to the number of iterations (x-axis: number of iterations, y-axis: function value; same for all the
32 following plots as well).
1
33 3. (3 Points) Perform SA with two different initial temperatures T = 1000 and T = 10, for each function. Plot
34 how the function values changes over iterations. Overall 3 functions and 2 different temperatures, so 6 graphs
35 in total. Because the algorithm is stochastic, for each function and each temperature, plot 5 different runs (i.e.
36 5 different random seeds of your choice) in the same graph. That is, start at the same initial point, and since
37 each step will be stochastic, you should plot 5 different trajectories for each function (i.e., 5 sequences of points
38 in each of the 6 plots requested, use a different color for each sequence). This requirement is the same for the
39 following two algorithms as well.
40 4. (4 Points) Perform CE with two sample sizes, k = 10 and k = 50, for each function. Perform 5 runs for each
41 function and each sample size. Plot the average of the function values of all samples in each iteration for each
42 function. Overall 6 plots (each plot should show 5 random trajectories).
43 5. (5 Points) Perform SG with two sample sizes, k = 10 and k = 50, for each function. Perform 5 runs for each
44 function and each sample size. Plot the average of the function values of all in each iteration for each function.
45 Overall 6 plots (each plot should show 5 random trajectories). Note: If you see the gradient update diverging
46 (leading to very large values), you can use the normalized gradient ∇f(x)/∥∇f(x)∥2 in each update instead
47 of just ∇f(x) (so that after multiplying with the learning rate, each update step is guaranteed to be small in
48 magnitude), but you don’t have to do this and you can just show the divergence behavior as well.
49 6. (1 Point) Based on the performance of these algorithms with different parameters on different types of functions,
50 summarize some intuition about how to choose among these algorithms and the parameters. For instance, when
51 the function dimension is large, is it better to sample or find gradient? What are the different trends of the
52 algorithms for different sample sizes? Do some algorithms give more stable behavior than others? There is no
53 standard answer here, just reflect a bit on how you should choose from the different strategies.
54 2 Classical Search on Graphs
55 Question 2 (3 Points). Prove the Separation Property for the Dijkstra algorithm. That is: after each iteration of the
56 algorithm (i.e., after the frontier has been fully updated in an iteration), any acyclic path from any state in the explored
57 set to any state in the unexplored set has to pass through some state in the frontier set. (The “acyclic” assumption here
58 simplifies the proof a bit, but it is not important, because for any path that contains cycles we can always ignore the
59 cycling part and only consider its acyclic part.) You will likely need to do mathematical induction over the number of
60 iterations to prove this.
61 Question 3 (2 Points). Construct a small undirected graph (say, a start node, a goal node, and just a few intermediate
62 nodes. You design the cost for each edge as well.) and design a heuristic function h such that: h itself is consistent, but
63 multiplying it by 5 (i.e. the heuristic value becomes 5h(s) for each node s) makes it inconsistent and misguides the
64 search (i.e. such that it no longer returns the optimal path in the end). Show the path from start to goal in each case.
65 3 Adversarial Search (Implementation Involved)
66 Warning: The minimax/expectimax algorithm is conceptually simple, but to make it work in an actual game it may take
67 longer than you expected, especially if you are not very familiar with Python (Hint: “deep copy”). Start early.
68 Using the starter code here https://github.com/ucsdcsegaoclasses/expectimax for the 2048
69 game. Implement two versions of Expectimax: one with Expectimax on just a depth-1 tree (we refer to it as “Exp-1”)
70 and one with a depth-3 tree (“Exp-3”). The depth-1 tree simply has the current game state as the root, and its children
71 nodes that correspond to the four actions as the leaf nodes, with payoff defined by the game score (provided by the
72 game engine) immediately annotated for these leaf nodes. The depth-3 tree is the game tree that contains all paths
73 following the sequence of “player-move, computer-move, player-move.” Use the score provide by the game engine
74 (the number displayed on the upper left corner in the game interface) as the payoff value at any leaf node in these trees
75 (i.e. the evaluation function here). If any action becomes infeasible (i.e. it becomes impossible to move the tiles in a
2
76 particular direction), you can annotate the corresponding child state as terminal with some very negative payoff such
77 as -100 (you can feel free to design anything, as you’ll only need to plot the actual game scores later).
78 Question 4 (4 Points). Plot the performance of Exp-1 and Exp-3 in the following way. For each version, plot 5 runs
79 of the algorithm starting from the beginning of the game for 100 moves (i.e., call the algorithm 100 times), and show
80 how the actual game score changes (i.e. what the game engines returns after you make each move. Annotate the y-axis
81 in the plot with this value.) over the number of iterations (x-axis). In any run, if the agent fails the game early (can’t
82 move anywhere) then just stop the sequence there.
83 Question 5 (3 Points). Design a different evaluation function for Exp-3 to perform better than Exp-3 with the previous
84 evaluation function that simply uses the original game score. You can use any information from the game state, such
85 as highest tile, pattern of the existing tiles, etc. Design your plots to show that the new algorithm (i.e. Exp-3 with the
86 evaluation function you designed) is better than the original Exp-3.
87 4 Basics of RL
88 Let’s derive some basic concepts for RL in a simple setting. Suppose Figure 3 is a schematic for routes for driving
89 from, say, UCSD to the SAN airport. Each state represents as a segment of the route. For instance s1 is the part within
90 campus. s2 is the segment that takes Highway 5 and s3 is the other option of taking 52 and 805, etc. The time spent
91 in each segment (i.e., each state), depending on traffic, is a random variable C(si). Our goal is to find good policies,
mapping states to actions (aij is the action that takes you from si
92 to sj ), and minimize the expected total time that it
93 takes to start from each state to arrival, which we write as Tπ(si) under some choice of π. Note that all these definitions
94 are similar to but not the same as the usual MDPs: we are minimizing time, and the actions deterministically take you
95 from states to states (assuming you are driving well enough and can always take exits successfully), but the cost of
each state is a random variable, and there is no discount factor.
Figure 1: Graph for the RL questions.
96
97 Question 6 (1 Point). Consider the following deterministic policy π: π(s1) = a12, π(s2) = a24, π(s4) = a46,
98 π(s3) = a35, π(s5) = a56, and s6 is a terminal state with no action. How do you define the overall expected time
99 Tπ(s1) at state the s1 under this policy π? Hint: The time cost at each state s is a random variable C(s), so use its
100 expectation E[C(s)]. Also, under this deterministic policy there is just one deterministic sequence of states that goes
101 from s1 to s6, where it terminates. You do not need to involve any state that is not part of this sequence to estimate the
102 expected time for starting from s1.
103 Question 7 (2 Points). Suppose you have formed good estimates of the minimal total travel time for starting from s2
104 and s3, which we write as T(s2) and T(s3). How would you define the total travel time starting at s1 by taking action
105 a12 (write that as T(s1, a12)), and that of taking a13 (write that as T(s1, a13))? Next, using T(s1, a12) and T(s1, a13),
106 how do you define the optimal total travel time at s1, written as T(s1)? Write down these as three equations for
107 defining T(s1, a12), T(s1, a13), and T(s1), in which you can use T(s2), T(s3), C(s1), etc. Hint: You need to take
108 into account of the expected time cost of E[C(s1)].
3
109 Question 8 (2 Points). Suppose you have lived here for long enough to form good estimates of T(s1). Today you have
110 just driven past the s1 segment and then took the action a13. You realized that because of new construction, the time
111 you spent at s1 today, written as C(s1), is much longer than usual. You also make the assumption that after that there
112 is no reason for T(s3) to change. Using this information, how would you update your estimated T(s1, a13)? If you
113 need a learning rate defined, write it as α.
114 Reflect a bit on your answers in these questions. They correspond to the state values, state-action values (Q-values),
115 and the temporal-difference update rule for Q-learning.
4