Description
Assignment 3: Graphical Models /Recurrent Neural Networks/ Reinforcement Learning
1 Graphical Models (22 marks)
Consider the problem of determining whether a local high school student will attend SFU or not.
Define a boolean random variable A (true if the person will attend SFU), discrete random variables
L (maximum of parents’ education level: can take values o for non-university or u for university)
and G (current provincial government: l for Liberal Party, d for NDP), and continuous valued
random variables E (current provincial economy size) and T (SFU tuition level).
1. 4 marks. Draw a simple Bayesian network for this domain.
2. 2 marks. Write the factored representation for the joint distribution p(A, L, G, E, T) that is
described by your Bayesian network.
3. 8 marks. Supply all necessary conditional distributions. Provide the type of distribution that
should be used and give rough guidance / example values for parameters (do this by hand,
educated guesses).
4. 8 marks. Suppose we had a training set and wanted to learn the parameters of the distributions using maximum likelihood. Denote each of the N examples with its values for each
random variable by xn = (an, ln, gn, en, tn). The training set is {x1, x2, . . . , xN }.
Which elements of the training data are needed to learn the parameters for p(A|paA)? Why?
(Note that paA denotes parents of A.)
Start by writing down the likelihood and argue from there.
2 Gated Recurrent Unit (10 marks)
A Gated Recurrent Unit (GRU) is another type of recurrent neural network unit with the ability
to remember and forget components of the state vector (see Cho et al. EMNLP 2014 https:
//arxiv.org/abs/1406.1078).
Read Sec. 2.3 of the linked paper for the description of the GRU. Note that the GRU’s state consists
of a vector of h values. There are two gates, rj and zj
, which control the update of hj
, the j
th
component of the GRU state.
• What values of rj and zj would cause the new state for hj
to be similar to its old state? Give
a short, qualitative answer.
• If rj and zj are both close to 0, how would the state for hj be updated? Give a short,
qualitative answer.
2
CMPT 419/726: Assignment 3
3 Reinforcement Learning (17 marks)
This question guides you through implementing the policy gradient algorithm with average reward
baseline.
Preparation:
• Install gym and TensorFlow for Python. Documentation can be found at https://gym.
openai.com/ and https://www.tensorflow.org/install.
• Replace cartpole.py in gym with the version provided. The included file
cartpole_stabilize.py contains the skeleton code for training a cartpole to achieve
its goal of keeping its position centred and pole upright.
The cartpole environment consists of a rotatable pole mounted on top of a cart. The states of the
system are the position and velocity (x, v) of the cart, and the angular position and velocity (θ, ω)
of the pole. The two possible actions are to push the cart left or right with a constant force.
Our goal in this problem is to keep the cart’s position near zero and the pole near upright for as long
as possible. To encourage this, in the custom environment defined in the provided cartpole.py
file, the cartpole system receives a reward of 1 for every time step in which its state satisfies
(|x| ≤ 0.5 and |θ| ≤ 4π
180 ). Training episodes terminate when the system state violates (|x| ≤
1.5 or |θ| ≤ 12π
180 ).
1. In the __init__ method of the agent class, define a policy network that takes as input the
state, has two fully hidden layers of the desired number of neurons with ReLU activation,
and outputs the probability distribution of applying the two possible actions.
2. In the __init__ method of the agent class, compute the probability of applying the actions
in the input data.
3. In the __init__ method of the agent class, define the loss function such that its gradient
is (∇θJ(θ)).
4. Complete the compute advantage function, which should compute a list of advantage values
(At = Σt
0≥tγ
t
0−t
r(st
, at) − b) for every time step across a batch of episodes, where (b =
Eτ∼p(τ;θ)Σt≥0γ
t
r(st
, at)) is the average reward across the batch of episodes. Note that the
batch size is specified by the update_frequency variable.
5. Complete the main part of the script (fill in the unmodified cartpole_stabilize.py
at lines 73-78, 104-107, 122-124).
6. Produce several plots showing the state of cart-pole system at different snapshots in time for
a well-performing episode.
7. Produce a plot showing sum of discounted reward in each episode vs. episode number.
3
CMPT 419/726: Assignment 3
4 Attention Models (Optional)
As an alternative to recurrent neural network structures, attention models can be used to analyze
an input sequence directly to compute a sequence of output state representations.
If you are interested in learning more, consider reading Vaswani et al. NIPS 2017 https://
arxiv.org/abs/1706.03762.
4