CSE 891-001: Deep Learning Homework 3

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment

Description

5/5 - (7 votes)

1 Binary Addition (2.5pts): In this problem, you will design a recurrent neural network to implement
binary addition. The inputs are provided as binary sequences, starting with the least significant bit. The
sequences are padded with at least one zero at the end. Here is an example:
0101101 + 0111010 = 1100111
where the inputs and outputs would be represented as:
• Input 1: 1, 0, 1, 1, 0, 1, 0
• Input 2: 0, 1, 0, 1, 1, 1, 0
• Correct Output: 1, 1, 1, 0, 0, 1, 1
At each time instance, the RNN would have two inputs and one output. Therefore, the pattern of inputs
and outputs for the above example would be:
(1, 0) (0, 1) (1, 0) (1, 1) (0, 1) (1, 1) (0, 0)
1 1 1 0 0 1 1
Design, by hand, the weights and biases for a RNN that can perform binary addition. The RNN should
consist of two inputs, three hidden nodes and one output. You can assume that all the nodes use the hard
threshold activation function. In particular, find the weights Wxh, Whh and Why, the bias vector bh and
scalar by.
1
2 LSTM Gradient (3.5pts): In this problem you will derive the backpropagation-through-time equations for a univariate version of the Long-Term Short-Term Memory (LSTM) architecture we saw in class.
Consider the case when the bias terms are assumed to be zero. So the LSTM computations are:
i
(t) = σ(wxix
(t) + whih
(t−1)
)
f
(t) = σ(wx f x
(t) + wh f h
(t−1)
)
o
(t) = σ(wxox
(t) + whoh
(t−1)
)
g
(t) = tanh(wxgx
(t) + whgh
(t−1)
)
c
(t) = f
(t)
c
(t−1) + i
(t)
g
(t)
h
(t) = o
(t)
tanh(c
(t)
)
1. (3pts) Derive the backpropagation-through-time equations for the activations and the gates:
h
(t) = . . .
c
(t) = . . .
g
(t) = . . .
o
(t) = . . .
f
(t) = . . .
i
(t) = . . .
2. (0.5pt) Derive the update equation for the weight wxi:
wxi = . . .
3. (Extra Credit: 0.5pt) Based on your answers above, explain why the gradient does not explode if the
values of the forget gates are very close to 1 and the values of the input and output gates are very
close to 0. Hint: Your answer should involve both h
(t) and c
(t)
.
2
3 Convolutional Neural Networks (2pts):
1. (1pt) Consider a CNN with 4 conv layers like in the diagram below. All 4 conv layers have kernel
size of 3 × 3. The number after the hyphen specifies the number of output channels or units of a
layer (e.g. Conv3-64 layer has 64 output channels and FC-1024 has 1024 output units). All the Max
Pool in the diagram has size of 2 × 2. Assume zero padding of 1 for conv layers and stride 2 for Max
Pool.
x
Conv3-64
Max-Pool
Conv3-128
Max-Pool
Conv3-256
Conv3-256
Max-Pool
FC-1024
FC-1000
Softmax
Size of the input image is 224 × 224 with 3 channels. Calculate the total number of parameters in the
network including the bias units.
2. (1pt) Consider the following two ResNet architectures for the placement of the batch norm and
the residual connection. “Weight” layer can be either a matrix multiplication or convolution. Which
architecture is easier to learn in terms of exploding / vanishing gradient? Provide a brief justification
for your answer.
Weight
BatchNorm
ReLU
Weight
Batch Norm
ReLU
+
Weight
BatchNorm
ReLU
Weight
+
Batch Norm
ReLU
3
4 Autoregressive Generative Models (2pt): In this question we will consider autoregressive models that
model the distribution of images in an autoregressive manner. For simplicity, assume that each conditional
distribution is a Gaussian with a fixed variance, so the model predicts the mean of the next pixel given all
the previous pixels.
1. (1pt) Here we will consider PixelCNN, which models the distribution of images using a convolutional architecture. PixelCNN masks each convolutional filter to only see the pixels that appear
before the current pixel in a raster scan order. See figure below for a visualization. Several of such
layers are stacked sequentially.
(a) Consider a d layer PixelCNN model, what is the total number of connections? Give your answers in terms of d, k, H, W. You only need to give O(·), not an exact count.
(b) Suppose that in each step we compute as many matrix-vector multiplications in parallel as we
can, what is the minimum number of the sequential operations to compute the output of a Pixel
CNN in terms of d, k, H, W. You only need to give O(·), not an exact count.
2. (1pt) Here we will consider Multidimensional RNN (MDRNN). This is like the RNNs we discussed
in the lecture, except that instead of a 1-D sequence, we have a 2-D grid structure. Analogous to
how ordinary RNNs have an input vector and a hidden vector for every time step, MDRNNs have
an input vector and hidden vector for every grid square. Each hidden unit receives bottom-up
connections from the corresponding input square, as well as recurrent connections from its north
and west neighbors as shown in the figure below.
The activations are computed as:
h
(i,j) = φ(WT
inx
(i,j) + WT
W x
(i−1,j) + WT
Nx
(i,j−1)
) (1)
Denote the number of input channels and the number of recurrent neurons at each layer to be k. The
input image size is H × W. For simplicity, we assume there are no bias parameters.
(a) Consider a d layer MDRNN model, what is the total number of connections? Give your answers
in terms of d, k, H, W. You only need to give O(·), not an exact count.
(b) Suppose that in each step we compute as many matrix-vector multiplications in parallel as we
can, what is the minimum number of the sequential operations to compute the output of a Pixel
CNN in terms of d, k, H, W. You only need to give O(·), not an exact count.
3. (Extra Credit: 1pt) What is the benefit of using PixelCNN over MDRNN. Discuss the pros and cons
of the two models in terms of their computational, memory complexity, parallelization potential and
the size of their context windows.
4