Description
1 Introduction
In part one of this assignment you will make a recurrent neural network, specifically you will replicate a
portion of the torch.nn.GRUCell interface. GRUs are used for a number of tasks such as Optical Character
Recognition and Speech Recognition on spectograms using transcripts of the dialog. This homework is to
develop your basic understanding of Backpropagating through a GRUCell, which can potentially be used for
GRU networks to grasp the concept of Backpropagation through time (BPTT).
2 GRU: Gated Recurrent Unit
You will be implementing the forward pass and backward pass for a GRUCell using python and numpy in
this assignment, analogous to the Pytorch equivalent nn.GRUCell. The equations for a GRU cell looks like
the following:
zt = (Wzhht1 + Wzxxt) (1)
rt = (Wrhht1 + Wrxxt) (2)
h˜t = tanh(Wh(rt ⌦ ht1) + Wxxt) (3)
ht = (1 zt) ⌦ ht1 + zt ⌦ h˜t (4)
Where xt is the input vector at time t, and ht the output. There are other possible implementations,
you need to follow the equations for the forward pass as shown above. If you do not, you might
end up with a working GRU and zero points on autolab. Do not modify the init method, if you do, it
might result in lost points.
Similar to previous assignments, you will be implementing a Python class, GRU Cell, found in gru.py.
Specifically, you will be implementing the forward and the backward methods.
2.1 GRU Cell Forward (30 Points)
In this section, you will implement the forward method of the GRU Cell. This method takes 2 inputs: the
observation at the current time-step, xt, and the hidden state at the previous time-step ht1.
Use Equations 1-4 to implement the forward method, and return the value of ht.
Hint: Store all relevant intermediary values in the forward pass.
2
2.2 GRU Cell Backward (70 Points)
The backward method of the GRU Cell, is the most time-consuming task of this homework.
This method takes as input delta, and must calculate the gradients wrt the parameters and returns the
derivative wrt the inputs, xt and ht, to the cell.
The partial derivative input you are given, delta, is the summation of the derivative of the loss wrt the input
of the next layer x(l + 1, t) and the derivative of the loss wrt the input hidden-state at the next time-step
h(l, t + 1).
Using these partials, you will need to compute the partial derivative of the loss wrt each of the six weight
matrices (see Equations 1-4), and the partial derivative of the loss wrt the input xt, and the hidden state ht.
Specifically, there are eight gradients that need to be computed:
1. @L
@Wrx , stored in self.dWrx
2. @L
@Wrh , stored in self.dWrh
3. @L
@Wzx , stored in self.dWzx
4. @L
@Wzh , stored in self.dWzh
5. @L
@Wx , stored in self.dWx
6. @L
@Wh , stored in self.dWh
7. @L
@xt , returned by the method
8. @L
@ht , returned by the method
You will need to derive the formulae for the back-propagation in order to complete this section of the
assignment