## Description

1 Tic-Tac-Toe (Implementation Required)

7 Consider the Tic-Tac-Toe game (if you type this in Google, it’ll let you play with it) against a purely random opponent.

8 That is, when the opponent needs to make a move, it will choose uniformly randomly from the empty spots, and put

9 its mark in it. The opponent plays first, i.e., from the empty 9 spots, it first picks an arbitrary one and mark it, as the

10 beginning of the game, and then it is your turn. Naturally, the game can be modeled as an MDP, with you taking actions

11 from relevant states in the game, and the opponent acting as the environment.

12 Use the following reward function: For any game state s that is not terminal (i.e., no winner and there is still some

13 empty spot), you receive R(s) = 1. For any game state s that you are the winner, R(s) = 10, and if you are the loser,

14 then R(s) = −10. If it is a draw state (all spots are occupied and there is no winner), then R(s) = 0. We will always

15 use the discount factor γ = 0.9.

Figure 1: A game state. Your play circle, and it is your turn now.

16 Question 1 (1 Points). Define mathematically the states and the actions of the MDP (you do not need to specify which

17 state has what actions, just define these two sets, i.e., the data structure you are gonna use and no need to enumerate

18 the elements). You can choose any representation you like, anything that you find easy to implement later. Using your

19 definition, list 5 terminal states of the MDP where you win, i.e., state s that satisfies R(s) = 10 as mentioned above.

20 Question 2 (3 Points). Implement the MDP so that it can generate all feasible trajectories of the game (you don’t need

to list these, of course). For this you need to specify the probabilistic transitions P(s

′

21 |s, a) for arbitrary state s, and

22 action a that is feasible from s for the player. Show 5 arbitrary full trajectories of the game (from some initial state to

23 some terminal state, and you can choose arbitrary actions in each step) and show the reward-to-go for each state in the

24 trajectory (check slides for definition).

1

25 Question 3 (3 Points). Implement value iteration for the MDP you defined, starting from arbitrary initial values until

the value of any state is at most 0.1 away from the optimal values (i.e., ∥V − V

∗

26 ∥∞ ≤ 0.1). Explain the justification

27 for why the values that you compute are no more than 0.1 different from the optimal values on any state.

28 For the 5 trajectories you showed above, show here the state value for each of the state that have been visited in

29 these trajectories.

30 Question 4 (3 Points). Recall that the optimal Q-value for any feasible state-action pair (s, a) is defined as

Q(s, a) = R(s) + γ

X

s

′

P(s

′

|s, a) · max

a′

Q(s

′

, a′

) = R(s) + γ

X

s

′

P(s

′

|s, a) · V

∗

(s

′

).

where V

∗

(s

′

) is the value of the state s

′

31 under the optimal policy. Choose two different initial states (i.e., the opponent

32 picks a different spot in each case) and show the Q-value of all actions for your agent for each of these two states.

33 Question 5 (4 Points). Consider the game state shown in Figure 1 (you play with circle and it is now your turn). We

34 will redefine the reward function as follows. For any game state s that is not terminal, set R(s) = c for some fixed

35 constant c. The rewards for all other states (win, lose, and draw states) stay the same as defined above.

36 Now, come up with two different numbers, c1 < 0 and c2 > 0, for this constant c used by all non-terminal states,

37 such that the optimal actions at the game state of Figure 1 are different under these two different reward functions.

38 Show the Q values of all actions at this state for each of these two reward functions (i.e., under c1 and under c2).

39 Question 6 (2 Points). Also draw (by hand, unless you code faster than that) the expectimax tree with the game state

40 in Figure 1 at the root, which means you should ignore the rewards on all non-terminal nodes, since in expectimax

41 only terminal nodes can have rewards. What are the optimal actions calculated by expectimax for this? Intuitively, why

42 can some definition of rewards, such as in Question 5 above, change the optimal actions? (just explain the reason in a

43 sentence or two, no need to be precise)

44 2 Function Approximation in RL (No Implementation)

45 Question 7 (6 Points). Consider the following MDP: S = {s0, s1, s2, s3}, A = {a1, a2, a3}. The transition proba46 bilities are P(s1|s0, a1) = 0.5, P(s2|s0, a1) = 0.5, P(s2|s0, a2) = 1, P(s3|s0, a3) = 1, and all other combinations

47 have probability 0. (So s1, s2, s3 are terminal states.) Let the discount factor be γ = 0.5.

48 Suppose we want to represent the Q-value function as a linear function

Q(s, a) = αs + βa (1)

49 where α, β ∈ R are the weight parameters that can be arbitrary chosen. When evaluating the function, we fix the

50 encoding for the states and actions as si = i + 1 and ai = i (e.g., s0 = 1 and a1 = 1). For instance, if α = 2 and

51 β = 3, then plugging them in with the encoding for the state and action we have Q(s0, a1) = 2 · 1 + 3 · 1 = 5.

52 1. (2 Points) Design a reward function such that the linear Q function in (1) can correctly represent all Q values at

53 s0 (yes, you need to calculate the Q values under the reward function you designed first). Show both the reward

54 function and the corresponding linear Q-value function.

55 2. (2 Points) Design a reward function such that the linear Q function does not work. Namely, the function in (1)

56 can not represent all Q values at s0 for any choice of α and β (i.e. the difference between the function value and

57 the true Q value is always large for some inputs, regardless of what α, β are.). Explain the reasoning.

58 3. (2 Points) Take the reward function you designed for the previous question (which fails to be captured by the

59 linear function) and design a different function approximation scheme (different from (1)), such that all Q-values

60 at s0 can now be correctly represented.

61 Note that for all these questions we are only asking the function representation to capture the Q values at state s0, i.e.,

62 it does not need to work for the other states (in general it should, just to simplify things here).

63 I hope these questions make you think about the complications that using function approximators can introduce

64 and why expressive nonlinear function approximators such as neural networks are necessary.

2

65 3 Blackjack (Implementation Required, Extra Credits)

66 Since towards the end of the quarter you will likely get busy, this implementation exercise is for extra credits only.

67 We encourage you to do it if you have time and are reasonably comfortable with Python as it should improve your

68 practical understanding of the relevant RL algorithms.

69 Implement Monte Carlo policy evaluation, Temporal-difference policy evaluation, and Q-learning for Blackjack

70 in the simulator here https://github.com/ucsdcsegaoclasses/blackjack. All details of the tasks and

71 all parameters needed are given in the README file or in the starter code. Read them carefully.

72 Question 8 (3 Extra Points). Select two game states, A and B: In State A the player’s sum of cards is 10, and in

73 State B the sum of cards is 20. Plot how the value estimate of the each state changes over the number of visits to the

74 state until the values convergence, under Monte Carlo policy evaluation and Temporal-Difference policy evaluation,

75 respectively; so 4 plots in total.

76 Question 9 (3 Extra Points). Perform Q-learning and plot how the Q-value changes over the number of visits to each

77 action for the same two game states you selected above, until you have run Q-learning for long enough so that the

78 Q-value converges at least on some action for each state (note: a very bad action may receive a small number of visits,

79 so this requirement is saying you only need to wait till the better action has been visited enough times so that the

80 Q-value of it stabilizes).

81 Also plot the cumulative winning rate over the number of plays in the game: for every n number of plays (x-axis),

82 show the ratio w/n (y-axis) where w is the total number of wins out of the n plays.

3