Homework 2: Convolutional Neural Networks and Recurrent Neural Networks CSCI-GA 2572 Deep Learning

$30.00

Category: Tags: , , , , , , You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (1 vote)

The goal of homework 2 is to get you to work with convolutional neural networks
and recurrent neural networks.
In the theoretical part, you will work on figuring out how backpropagation works
in these networks. In part 2, we will implement and train them.
In part 1, you should submit all your answers in a pdf file. As before, we recommend using LATEX.
For part 2, you will implement some neural networks by adding your code to the
provided ipynb file.
As before, please use numerator layout.
The due date of homework 2 is 11:59pm 02/26. Submit the following files in a
zip file your_net_id.zip through NYU Brightspace:
• hw2_theory.pdf
• hw2_cnn.ipynb
• hw2_rnn.ipynb
• 08-seq_classification.ipynb
The following behaviors will result in penalty of your final score:
1. 10% penalty for submitting your file without using the correct naming format (including naming the zip file, PDF file or python file wrong, adding
extra files in the zip folder, like the testing scripts in your zip file).
2. 20% penalty for late submission within the first 24 hours after the deadline.
We will not accept any late submission after the first 24 hours.
3. 20% penalty for code submission that cannot be executed following the
steps we mentioned.
1
1 Theory (50pt)
1.1 Convolutional Neural Netoworks (15 pts)
(a) (1 pts) Given an input image of dimension 12 × 21, what will be output
dimension after applying a convolution with 5×4 kernel, stride of 4, and no
padding?
(b) (2 pts) Given an input of dimension C × H ×W, what will be the dimension
of the output of a convolutional layer with kernel of size K × K, padding P,
stride S, dilation D, and F filters. Assume that H ≥ K, W ≥ K.
(c) (12 pts) In this section, we are going to work with 1-dimensional convolutions. Discrete convolution of 1-dimensional input x[n] and kernel k[n] is
defined as follows:
s[n] = (x∗ k)[n] =
X
m
x[n− m]k[m]
However, in machine learning convolution is usually implemented as crosscorrelation, which is defined as follows:
s[n] = (x∗ k)[n] =
X
m
x[n+ m]k[m]
Note the difference in signs, which will get the network to learn an “flipped”
kernel. In general it doesn’t change much, but it’s important to keep it
in mind. In convolutional neural networks, the kernel k[n] is usually 0
everywhere, except a few values near 0: ∀|n|>M k[n] = 0. Then, the formula
becomes:
s[n] = (x∗ k)[n] =
X
M
m=−M
x[n+ m]k[m]
Let’s consider an input x[n] ∈ R
5
, with 1 ≤ n ≤ 7, e.g. it is a length 7 sequence with 5 channels. We consider the convolutional layer fW with one
filter, with kernel size 3, stride of 2, no dilation, and no padding. The only
parameters of the convolutional layer is the weight W, W ∈ R
1×5×3
, there’s
no bias and no non-linearity.
(i) (1 pts) What is the dimension of the output fW (x)? Provide an expression for the value of elements of the convolutional layer output fW (x).
Example answer format here and in the following sub-problems: fW (x) ∈
R
42×42×42
, fW (x)[i, j,k] = 42.
(ii) (3 pts) What is the dimension of ∂fW (x)
∂W
? Provide an expression for the
values of the derivative ∂fW (x)
∂W
.
(iii) (3 pts) What is the dimension of ∂fW (x)
∂x
? Provide an expression for the
values of the derivative ∂fW (x)
∂x
.
2
(iv) (5 pts) Now, suppose you are given the gradient of the loss ℓ w.r.t.
the output of the convolutional layer fW (x), i.e. ∂ℓ
∂fW (x)
. What is the
dimension of ∂ℓ
∂W
? Provide an expression for ∂ℓ
∂W
. Explain similarities
and differences of this expression and expression in (i).
1.2 Recurrent Neural Networks (30 pts)
1.2.1 Part 1
In this section we consider a simple recurrent neural network defined as follows:
c[t] = σ(Wcx[t]+Whh[t−1]) (1)
h[t] = c[t]⊙ h[t−1]+(1− c[t])⊙Wx x[t] (2)
where σ is element-wise sigmoid, x[t] ∈ R
n
, h[t] ∈ R
m, Wc ∈ R
m×n
, Wh ∈ R
m×m,
Wx ∈ R
m×n
, ⊙ is Hadamard product, h[0] .
= 0.
(a) (4 pts) Draw a diagram for this recurrent neural network, similar to the
diagram of RNN we had in class. We suggest using diagrams.net.
(b) (1pts) What is the dimension of c[t]?
(c) (5 pts) Suppose that we run the RNN to get a sequence of h[t] for t from 1
to K. Assuming we know the derivative ∂ℓ
∂h[t]
, provide dimension of and an
expression for values of ∂ℓ
∂Wx
. What are the similarities of backward pass
and forward pass in this RNN?
(d) (2pts) Can this network be subject to vanishing or exploding gradients?
Why?
1.2.2 Part 2
We define an AttentionRNN(2) as
q0[t], q1[t], q2[t] = Q0x[t],Q1h[t−1],Q2h[t−2] (3)
k0[t],k1[t],k2[t] = K0x[t],K1h[t−1],K2h[t−2] (4)
v0[t],v1[t],v2[t] = V0x[t],V1h[t−1],V2h[t−2] (5)
wi[t] = qi[t]

ki[t] (6)
a[t] = softargmax([w0[t],w1[t],w2[t]]) (7)
h[t] =
X
2
i=0
ai[t]vi[t] (8)
Where x[t],h[t] ∈ R
n
, and Qi
,Ki
,Vi ∈ R
n×n
. We define h[t] = 0 for t < 1. You may
safely ignore these bases cases in the following questions.
3
(a) (4 pts) Draw a diagram for this recurrent neural network
(b) (1 pt) What is the dimension of a[t]?
(c) (3 pts) Extend this to, AttentionRNN(k), a network that uses the last k
state vectors h. Write out the system of equations that defines it. You may
use set notation or ellipses (…) in your definition.
(d) (3 pts) Modify the above network to produce AttentionRNN(∞), a network
that uses every past state vector. Write out the system of equations that defines it. You may use set notation or ellipses (…) in your definition. HINT:
We can do this by tying together some set of parameters, e.g. weight sharing.
(e) (5 pts) Suppose the loss ℓ is computed. Please write down the expression
for ∂h[t]
∂h[t−1] for AttentionRNN(2).
(f) (2 pts) Suppose we know the derivative ∂h[t]
∂h[T]
, and ∂ℓ
∂h[t]
for all t > T. Please
write down the expression for ∂ℓ
∂h[T]
for AttentionRNN(k).
1.3 Debugging loss curves (5pts)
When working with notebook
08-seq_classification, we saw RNN training curves. In Section 8 “Visualize
LSTM”, we observed some “kinks” in the loss curve.
0 20 40 60 80 100
epoch
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
loss
Train
Test
0 20 40 60 80 100
epoch
0
20
40
60
80
100
acc
Train
Test
1. (1pts) What caused the spikes on the left?
2. (1pts) How can they be higher than the initial value of the loss?
3. (1pts) What are some ways to fix them?
4. (2pts) Explain why the loss and accuracy are at these values before training
starts. You may need to check the task definition in the notebook.
4
2 Implementation (50pts + 5pts extra credit)
There are three notebooks in the practical part:
• (25pts) Convolutional Neural Networks notebook: hw2_cnn.ipynb
• (20pts) Recurrent Neural Networks notebook: hw2_rnn.ipynb
• (5pts + 5pts extra credit) : This builds on Section 1.3 of the theoretical part.
– (5pts) Change the model training procedure of Section 8 in
08-seq_classification to make the training curves have no spikes.
You should only change the training of the model, and not the model
itself or the random seed.
– (5pts extra credit) Visualize the gradients and weights throughout
training before and after you fix the training procedure.
Plase use your NYU Google Drive account to access the notebooks. First
two notebooks contain parts marked as TODO, where you should put your code.
These notebooks are Google Colab notebooks, you should copy them to your drive,
add your solutions, and then download and submit them to NYU Brightspace.
The notebook from the class, if needed, can be uploaded to Colab as well.
5