## Description

1 Stochastic Search (Implementation Involved)

8 Question 1 (16 Points). Consider the following functions:

• f1(x) = x

2

1 + x

2

2

9

• f2(x) = −

1 + cos(12p

x

2

1 + x

2

2

)

0.5(x

2

1 + x

2

2

) + 2 10 (this is called the drop-wave function)

• f3(x) = P50

i=1 x

2

i

11 (yes it’s a 50-dimensional function, don’t freak out)

12 Implement the following algorithms (parameters will be specified below):

13 1. (GD) Gradient Descent with fixed step size α.

14 2. (SA) Simulated Annealing with initial temperature T and the annealing schedule Tk = T /k for each k-th

15 iteration (in the first iteration k = 1).

16 3. (CE) Cross-Entropy Methods with k samples in each iteration, and the elite set is formed by the top 20% of

17 all samples in each iteration (whether “top” means higher or lower function value depends on whether you are

18 minimizing or maximizing).

19 4. (SG) Search gradient with k samples in each iteration, and a fixed learning rate of η = 0.01. (Also, first be clear

20 about whether you are minimizing or maximizing a function, and then decide whether you should do gradient

21 descent or ascent.)

22 For all algorithms, start at the initial point xi = 2 for each variable xi used in the function. For (SA), where the search

23 distribution for the next sample is needed, and (CE) and (SG), where an initial distribution is needed, use the Gaussian

24 distribution N (µ, Σ) with Σ being the identify matrix I of the appropriate size (i.e., if the function uses n variables, I

25 is of size n × n.). You should know what µ should be in each of these cases.

26 Plot the following graphs. For each function, our goal is to minimize the function so make sure your updates are

27 in the right direction. Also for each run of any algorithm, always run 100 iterations from the initial point.

28 1. (1 Point) Plot in 3D the drop-wave function f2 defined above, over the domain (x1, x2) ∈ [−5, 5] × [−5, 5] (the

29 vertical axis should show the value of f2(x1, x2) over this domain).

30 2. (2 Point) Perform GD with step size α = 0.01 on the three functions. Plot how the function value changes

31 with respect to the number of iterations (x-axis: number of iterations, y-axis: function value; same for all the

32 following plots as well).

1

33 3. (3 Points) Perform SA with two different initial temperatures T = 1000 and T = 10, for each function. Plot

34 how the function values changes over iterations. Overall 3 functions and 2 different temperatures, so 6 graphs

35 in total. Because the algorithm is stochastic, for each function and each temperature, plot 5 different runs (i.e.

36 5 different random seeds of your choice) in the same graph. That is, start at the same initial point, and since

37 each step will be stochastic, you should plot 5 different trajectories for each function (i.e., 5 sequences of points

38 in each of the 6 plots requested, use a different color for each sequence). This requirement is the same for the

39 following two algorithms as well.

40 4. (4 Points) Perform CE with two sample sizes, k = 10 and k = 50, for each function. Perform 5 runs for each

41 function and each sample size. Plot the average of the function values of all samples in each iteration for each

42 function. Overall 6 plots (each plot should show 5 random trajectories).

43 5. (5 Points) Perform SG with two sample sizes, k = 10 and k = 50, for each function. Perform 5 runs for each

44 function and each sample size. Plot the average of the function values of all in each iteration for each function.

45 Overall 6 plots (each plot should show 5 random trajectories). Note: If you see the gradient update diverging

46 (leading to very large values), you can use the normalized gradient ∇f(x)/∥∇f(x)∥2 in each update instead

47 of just ∇f(x) (so that after multiplying with the learning rate, each update step is guaranteed to be small in

48 magnitude), but you don’t have to do this and you can just show the divergence behavior as well.

49 6. (1 Point) Based on the performance of these algorithms with different parameters on different types of functions,

50 summarize some intuition about how to choose among these algorithms and the parameters. For instance, when

51 the function dimension is large, is it better to sample or find gradient? What are the different trends of the

52 algorithms for different sample sizes? Do some algorithms give more stable behavior than others? There is no

53 standard answer here, just reflect a bit on how you should choose from the different strategies.

54 2 Classical Search on Graphs

55 Question 2 (3 Points). Prove the Separation Property for the Dijkstra algorithm. That is: after each iteration of the

56 algorithm (i.e., after the frontier has been fully updated in an iteration), any acyclic path from any state in the explored

57 set to any state in the unexplored set has to pass through some state in the frontier set. (The “acyclic” assumption here

58 simplifies the proof a bit, but it is not important, because for any path that contains cycles we can always ignore the

59 cycling part and only consider its acyclic part.) You will likely need to do mathematical induction over the number of

60 iterations to prove this.

61 Question 3 (2 Points). Construct a small undirected graph (say, a start node, a goal node, and just a few intermediate

62 nodes. You design the cost for each edge as well.) and design a heuristic function h such that: h itself is consistent, but

63 multiplying it by 5 (i.e. the heuristic value becomes 5h(s) for each node s) makes it inconsistent and misguides the

64 search (i.e. such that it no longer returns the optimal path in the end). Show the path from start to goal in each case.

65 3 Adversarial Search (Implementation Involved)

66 Warning: The minimax/expectimax algorithm is conceptually simple, but to make it work in an actual game it may take

67 longer than you expected, especially if you are not very familiar with Python (Hint: “deep copy”). Start early.

68 Using the starter code here https://github.com/ucsdcsegaoclasses/expectimax for the 2048

69 game. Implement two versions of Expectimax: one with Expectimax on just a depth-1 tree (we refer to it as “Exp-1”)

70 and one with a depth-3 tree (“Exp-3”). The depth-1 tree simply has the current game state as the root, and its children

71 nodes that correspond to the four actions as the leaf nodes, with payoff defined by the game score (provided by the

72 game engine) immediately annotated for these leaf nodes. The depth-3 tree is the game tree that contains all paths

73 following the sequence of “player-move, computer-move, player-move.” Use the score provide by the game engine

74 (the number displayed on the upper left corner in the game interface) as the payoff value at any leaf node in these trees

75 (i.e. the evaluation function here). If any action becomes infeasible (i.e. it becomes impossible to move the tiles in a

2

76 particular direction), you can annotate the corresponding child state as terminal with some very negative payoff such

77 as -100 (you can feel free to design anything, as you’ll only need to plot the actual game scores later).

78 Question 4 (4 Points). Plot the performance of Exp-1 and Exp-3 in the following way. For each version, plot 5 runs

79 of the algorithm starting from the beginning of the game for 100 moves (i.e., call the algorithm 100 times), and show

80 how the actual game score changes (i.e. what the game engines returns after you make each move. Annotate the y-axis

81 in the plot with this value.) over the number of iterations (x-axis). In any run, if the agent fails the game early (can’t

82 move anywhere) then just stop the sequence there.

83 Question 5 (3 Points). Design a different evaluation function for Exp-3 to perform better than Exp-3 with the previous

84 evaluation function that simply uses the original game score. You can use any information from the game state, such

85 as highest tile, pattern of the existing tiles, etc. Design your plots to show that the new algorithm (i.e. Exp-3 with the

86 evaluation function you designed) is better than the original Exp-3.

87 4 Basics of RL

88 Let’s derive some basic concepts for RL in a simple setting. Suppose Figure 3 is a schematic for routes for driving

89 from, say, UCSD to the SAN airport. Each state represents as a segment of the route. For instance s1 is the part within

90 campus. s2 is the segment that takes Highway 5 and s3 is the other option of taking 52 and 805, etc. The time spent

91 in each segment (i.e., each state), depending on traffic, is a random variable C(si). Our goal is to find good policies,

mapping states to actions (aij is the action that takes you from si

92 to sj ), and minimize the expected total time that it

93 takes to start from each state to arrival, which we write as Tπ(si) under some choice of π. Note that all these definitions

94 are similar to but not the same as the usual MDPs: we are minimizing time, and the actions deterministically take you

95 from states to states (assuming you are driving well enough and can always take exits successfully), but the cost of

each state is a random variable, and there is no discount factor.

Figure 1: Graph for the RL questions.

96

97 Question 6 (1 Point). Consider the following deterministic policy π: π(s1) = a12, π(s2) = a24, π(s4) = a46,

98 π(s3) = a35, π(s5) = a56, and s6 is a terminal state with no action. How do you define the overall expected time

99 Tπ(s1) at state the s1 under this policy π? Hint: The time cost at each state s is a random variable C(s), so use its

100 expectation E[C(s)]. Also, under this deterministic policy there is just one deterministic sequence of states that goes

101 from s1 to s6, where it terminates. You do not need to involve any state that is not part of this sequence to estimate the

102 expected time for starting from s1.

103 Question 7 (2 Points). Suppose you have formed good estimates of the minimal total travel time for starting from s2

104 and s3, which we write as T(s2) and T(s3). How would you define the total travel time starting at s1 by taking action

105 a12 (write that as T(s1, a12)), and that of taking a13 (write that as T(s1, a13))? Next, using T(s1, a12) and T(s1, a13),

106 how do you define the optimal total travel time at s1, written as T(s1)? Write down these as three equations for

107 defining T(s1, a12), T(s1, a13), and T(s1), in which you can use T(s2), T(s3), C(s1), etc. Hint: You need to take

108 into account of the expected time cost of E[C(s1)].

3

109 Question 8 (2 Points). Suppose you have lived here for long enough to form good estimates of T(s1). Today you have

110 just driven past the s1 segment and then took the action a13. You realized that because of new construction, the time

111 you spent at s1 today, written as C(s1), is much longer than usual. You also make the assumption that after that there

112 is no reason for T(s3) to change. Using this information, how would you update your estimated T(s1, a13)? If you

113 need a learning rate defined, write it as α.

114 Reflect a bit on your answers in these questions. They correspond to the state values, state-action values (Q-values),

115 and the temporal-difference update rule for Q-learning.

4