CSC546 Homework Assignment 4

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment

Description

5/5 - (3 votes)

Part-1: Basic Concepts
1. Backpropagation in A Neural Network (7 points)
π‘₯π‘₯, 𝑀𝑀1, 𝑀𝑀2, 𝑀𝑀3, 𝑀𝑀4, 𝑏𝑏1, 𝑏𝑏2, 𝑏𝑏3, β„Ž1, β„Ž2, β„Ž3 are scalars
Compute the derivatives of the loss 𝐿𝐿 with respect to parameters and input, assuming πœ•πœ•πœ•πœ•
πœ•πœ•β„Ž3
is known.
Example:
This is correct: πœ•πœ•πœ•πœ•
πœ•πœ•π‘€π‘€1
= πœ•πœ•πœ•πœ•
πœ•πœ•β„Ž3
πœ•πœ•β„Ž3
πœ•πœ•β„Ž1
πœ•πœ•β„Ž1
πœ•πœ•π‘€π‘€1
= πœ•πœ•πœ•πœ•
πœ•πœ•β„Ž3
𝑓𝑓3
β€²
𝑀𝑀3𝑓𝑓1
β€²
π‘₯π‘₯ you get 1 point
This partial solution is not acceptable: πœ•πœ•πœ•πœ•
πœ•πœ•π‘€π‘€1
= πœ•πœ•πœ•πœ•
πœ•πœ•β„Ž3
πœ•πœ•β„Ž3
πœ•πœ•β„Ž1
πœ•πœ•β„Ž1
πœ•πœ•π‘€π‘€1
, you get 0 point
πœ•πœ•πœ•πœ•
πœ•πœ•π‘€π‘€2

πœ•πœ•πœ•πœ•
πœ•πœ•π‘€π‘€3

πœ•πœ•πœ•πœ•
πœ•πœ•π‘€π‘€4

πœ•πœ•πœ•πœ•
πœ•πœ•π‘π‘1

πœ•πœ•πœ•πœ•
πœ•πœ•π‘π‘2

πœ•πœ•πœ•πœ•
πœ•πœ•π‘π‘3

πœ•πœ•πœ•πœ•
πœ•πœ•πœ•πœ•
𝑦𝑦� = 𝑓𝑓3(𝑀𝑀3β„Ž1 + 𝑀𝑀4β„Ž2 + 𝑏𝑏3)
β„Ž1 = 𝑓𝑓1(𝑀𝑀1π‘₯π‘₯ + 𝑏𝑏1)
β„Ž2 = 𝑓𝑓2(𝑀𝑀2π‘₯π‘₯ + 𝑏𝑏2)
𝑓𝑓𝑛𝑛
β€² = πœ•πœ•π‘“π‘“π‘›π‘›(𝑣𝑣)
πœ•πœ•π‘£π‘£ , n = 1,2,3
2. Computational Graph (20 points)
β„Ž = 2π‘₯π‘₯ + 1
𝑧𝑧 = π‘₯π‘₯2 + β„Ž
𝑦𝑦 = 1
1 + π‘’π‘’βˆ’β„Ž
(1) Draw the computational graph based on the above three equations (1 point)
(2) What is πœ•πœ•πœ•πœ•
πœ•πœ•πœ•πœ• from the graph? (19 points)
3. Target (output) Normalization for a Neural Network
Usually, we need to apply normalization/standardization to the inputs for classification and regression tasks, so
that the input will be in the range of 0 to 1, or -1 to +1. For example, if the input is an image, then every pixel
value is divided by 255, so that the pixel values of the normalized image are in the range of 0 to 1. Input
normalization facilitates the convergence of training algorithms.
We may also need to apply normalization to the output. Assume the input is an image of a person, the output
vector has two components, 𝑦𝑦�(1) and 𝑦𝑦�(2): 𝑦𝑦�(1) is the monthly income (in the range of 0 to 10,000), and 𝑦𝑦�(2) is
the age (in the range of 0 to 100). The MSE loss for a single data sample is
𝐿𝐿 = (𝑦𝑦�(1) βˆ’ 𝑦𝑦(1))2 + (𝑦𝑦�(2) βˆ’ 𝑦𝑦(2))2
where 𝑦𝑦(1) and 𝑦𝑦(2) are ground truth values of an input data sample.
Question: is output (i.e., the output target 𝑦𝑦(1), 𝑦𝑦(2)) normalization necessary for this task? Why?
If it is necessary, what normalization can be applied?
4. Activation Functions for Regression
Neural networks can be used for regression. To model nonlinear input-output relationship, a neural network needs
nonlinear activation functions in the hidden layers. Usually, the output layer does not need nonlinear activation
functions. However, sometimes, there are requirements for outputs. For example, if the output is the sale price of
a house, then the output should be nonnegative.
Assume 𝒛𝒛 is the scalar output of a network, and the network does not have nonlinear activation function in the
output layer. Now, there is some requirement for output, and you decide to add a nonlinear activation function.
You design nonlinear activation functions for three different requirements:
(1) the final output 𝑦𝑦 should be nonnegative (𝑦𝑦 β‰₯ 0), then what is the activation function 𝑦𝑦 = 𝑓𝑓(𝑧𝑧) ?
(2) the final output 𝑦𝑦 should be nonpositive (𝑦𝑦 ≀ 0), then what is the activation function 𝑦𝑦 = 𝑓𝑓(𝑧𝑧) ?
(3) the final output 𝑦𝑦 should be π‘Žπ‘Ž ≀ 𝑦𝑦 ≀ 𝑏𝑏, then what is the activation function 𝑦𝑦 = 𝑓𝑓(𝑧𝑧) ?
You may use a combination of the basic activation functions that you can find in the lecture notes or the
documents of Keras and Pytorch. Do NOT use if statements, for example:
def activation(z):
if z < a: return a
elif z > b: return b
else: return z
This is not an acceptable answer.
5. Normalization Inside a Neural Network
To facilitate the convergence of training a deep neural network, it is necessary to normalize the output or input of
each layer of a neural network.
Read the paper https://arxiv.org/abs/1803.08494 and answer the questions:
(1) Batch Normalization will be highly unstable if batch_size is very small. Why?
(2) Why is Layer Normalization independent of batch_size?
6. Skip Connections in a Deep Neural Network
Read: https://arxiv.org/abs/1512.03385 and answer the question
Why are skip/residual connections useful to build a deep network?
7. Randomness of a Neural Network
We can train the same network three times on the same dataset to get model-1, model-2, and model-3. However,
the performance of these models on the test set could be different.
What is the cause of this randomness? If you write a technical paper, which model should you choose to report
on the paper? The worst model? or the best model? or all of the three models?
8. ReLU and Piecewise Linear
Prove that an MLP with ReLU activations is a piecewise linear function of the input
Note: this is not the answer β€œit is because ReLU is piecewise linear”
Part-2 Programming
Programming tasks: H4P2T1 and H4P2T2 using the template ECG_Keras_template.ipynb or
ECG_Pytorch_template.ipynb. You may choose to use Keras or Pytorch.
Grading
The number of points for each question/task
Undergrad Student Graduate Student
1. Backpropagation 7 7
2. Computational Graph 20 20
3. Output Target Normalization 2 2
4. Activations for Regression extra 5 points 6
5. Normalization inside Network 2 2
6. Skip connection 2 2
7. Randomness 2 1
8. ReLU and piecewise linear N.A. extra 5 points
H4P2T1 (MLP) 30 30
H4P2T2 (CNN) 35 30
Read the instructions about H4P2T1 and H4P2T2 on the following pages.
In H4P2T1, you will implement an MLP with residual connections for ECG signal classification according to the
diagram below. You will lose points if your network deviates from the diagram: one deviation costs 10 points.
The following actions are deviations: miss a connection, add an extra connection, miss a layer, add an extra layer,
or use different parameters (not those defined in the diagram) for the pooling layers. Note: Softmax is optional in
Pytorch, so missing Softmax in Pytorch is not a deviation.
For these linear/dense layers: you are free to choose the internal parameters (e.g., the number of units in a layer).
For these normalization layers: you choose a feasible normalization method from GroupNorm, InstanceNorm,
BatchNorm, and LayerNorm (see Q5).
For these average pooling layers: the pooling window size (named pool_size in Keras, kernel_size in Pytorch) is
fixed to 2, and the stride is fixed to 2 (it is named strides in Keras). You may add padding if necessary.
The accuracy on the test set is about 89%. You will get zero score if the test accuracy < 85%
Input x (2D)
Linear/Dense (linear1)
ReLU
Linear/Dense (linear2)
ReLU
+
Normalization (norm3)
Linear/Dense (linear3)
ReLU
+
pooling
(avg1)
pooling
(avg2)
Normalization (norm4)
Linear/Dense (linear4)
ReLU
+
pooling
(avg3)
Normalization (norm5)
Linear/Dense (linear5)
ReLU
Linear/Dense (linear6)
Output z
Softmax
Output y_hat
average pooling (1D)
window size = 2
stride=2
average pooling (1D)
window size = 2
stride=2
average pooling (1D)
window size = 2
stride=2
reshape to add/remove
channel axis
only used in Keras
+
xb
xa
xc xc=xa+xb
In H4P2T2, you will implement a CNN with residual connections for ECG signal classification according to the
diagram below. You will lose points if your network deviates from the diagram: one deviation costs 10 points.
The following actions are deviations: miss a connection, add an extra connection, miss a layer, add an extra layer,
or use different parameters (not those defined in the diagram) for the pooling layers. Note: Softmax is optional in
Pytorch, so missing Softmax in Pytorch is not a deviation.
For these convolution and linear/dense layers: you are free to choose the internal parameters (e.g., the number
of kernels/filters, the number of input channels, the number of output channels, stride, padding, etc).
For these normalization layers: you choose a feasible normalization method from GroupNorm, InstanceNorm,
BatchNorm, and LayerNorm (see Q5).
For these average pooling layers: the pooling window size (named pool_size in Keras, kernel_size in Pytorch) is
fixed to 2, and the stride is fixed to 2 (it is named strides in Keras). You may add padding if necessary.
The accuracy on the test set is about 90%. You will get zero score if the test accuracy < 85%
Input x (3D)
Convolution (conv1)
ReLU
Convolution (conv2)
ReLU
+
Normalization (norm3)
Convolution (conv3)
ReLU
+
pooling
(avg1)
pooling
(avg2)
Normalization (norm4)
Convolution (conv4)
ReLU
+
pooling
(avg3)
reshape/view/flatten
3D => 2D
Linear/Dense (fc1)
ReLU
Normalization (norm5)
Convolution (conv5)
ReLU
Linear/Dense (fc2)
Output z
Softmax
Output y_hat
average pooling (1D)
window size = 2
stride=2
average pooling (1D)
window size = 2
stride=2
average pooling (1D)
window size = 2
stride=2
+
xb
xa
xc xc=xa+xb