# CSC546 Homework Assignment 4

\$30.00

## Description

Part-1: Basic Concepts
1. Backpropagation in A Neural Network (7 points)
π₯π₯, π€π€1, π€π€2, π€π€3, π€π€4, ππ1, ππ2, ππ3, β1, β2, β3 are scalars
Compute the derivatives of the loss πΏπΏ with respect to parameters and input, assuming ππππ
ππβ3
is known.
Example:
This is correct: ππππ
πππ€π€1
= ππππ
ππβ3
ππβ3
ππβ1
ππβ1
πππ€π€1
= ππππ
ππβ3
ππ3
β²
π€π€3ππ1
β²
π₯π₯ you get 1 point
This partial solution is not acceptable: ππππ
πππ€π€1
= ππππ
ππβ3
ππβ3
ππβ1
ππβ1
πππ€π€1
, you get 0 point
ππππ
πππ€π€2

ππππ
πππ€π€3

ππππ
πππ€π€4

ππππ
ππππ1

ππππ
ππππ2

ππππ
ππππ3

ππππ
ππππ
π¦π¦οΏ½ = ππ3(π€π€3β1 + π€π€4β2 + ππ3)
β1 = ππ1(π€π€1π₯π₯ + ππ1)
β2 = ππ2(π€π€2π₯π₯ + ππ2)
ππππ
β² = ππππππ(π£π£)
πππ£π£ , n = 1,2,3
2. Computational Graph (20 points)
β = 2π₯π₯ + 1
π§π§ = π₯π₯2 + β
π¦π¦ = 1
1 + ππββ
(1) Draw the computational graph based on the above three equations (1 point)
(2) What is ππππ
ππππ from the graph? (19 points)
3. Target (output) Normalization for a Neural Network
Usually, we need to apply normalization/standardization to the inputs for classification and regression tasks, so
that the input will be in the range of 0 to 1, or -1 to +1. For example, if the input is an image, then every pixel
value is divided by 255, so that the pixel values of the normalized image are in the range of 0 to 1. Input
normalization facilitates the convergence of training algorithms.
We may also need to apply normalization to the output. Assume the input is an image of a person, the output
vector has two components, π¦π¦οΏ½(1) and π¦π¦οΏ½(2): π¦π¦οΏ½(1) is the monthly income (in the range of 0 to 10,000), and π¦π¦οΏ½(2) is
the age (in the range of 0 to 100). The MSE loss for a single data sample is
πΏπΏ = (π¦π¦οΏ½(1) β π¦π¦(1))2 + (π¦π¦οΏ½(2) β π¦π¦(2))2
where π¦π¦(1) and π¦π¦(2) are ground truth values of an input data sample.
Question: is output (i.e., the output target π¦π¦(1), π¦π¦(2)) normalization necessary for this task? Why?
If it is necessary, what normalization can be applied?
4. Activation Functions for Regression
Neural networks can be used for regression. To model nonlinear input-output relationship, a neural network needs
nonlinear activation functions in the hidden layers. Usually, the output layer does not need nonlinear activation
functions. However, sometimes, there are requirements for outputs. For example, if the output is the sale price of
a house, then the output should be nonnegative.
Assume ππ is the scalar output of a network, and the network does not have nonlinear activation function in the
output layer. Now, there is some requirement for output, and you decide to add a nonlinear activation function.
You design nonlinear activation functions for three different requirements:
(1) the final output π¦π¦ should be nonnegative (π¦π¦ β₯ 0), then what is the activation function π¦π¦ = ππ(π§π§) ?
(2) the final output π¦π¦ should be nonpositive (π¦π¦ β€ 0), then what is the activation function π¦π¦ = ππ(π§π§) ?
(3) the final output π¦π¦ should be ππ β€ π¦π¦ β€ ππ, then what is the activation function π¦π¦ = ππ(π§π§) ?
You may use a combination of the basic activation functions that you can find in the lecture notes or the
documents of Keras and Pytorch. Do NOT use if statements, for example:
def activation(z):
if z < a: return a
elif z > b: return b
else: return z
This is not an acceptable answer.
5. Normalization Inside a Neural Network
To facilitate the convergence of training a deep neural network, it is necessary to normalize the output or input of
each layer of a neural network.
(1) Batch Normalization will be highly unstable if batch_size is very small. Why?
(2) Why is Layer Normalization independent of batch_size?
6. Skip Connections in a Deep Neural Network
Why are skip/residual connections useful to build a deep network?
7. Randomness of a Neural Network
We can train the same network three times on the same dataset to get model-1, model-2, and model-3. However,
the performance of these models on the test set could be different.
What is the cause of this randomness? If you write a technical paper, which model should you choose to report
on the paper? The worst model? or the best model? or all of the three models?
8. ReLU and Piecewise Linear
Prove that an MLP with ReLU activations is a piecewise linear function of the input
Note: this is not the answer βit is because ReLU is piecewise linearβ
Part-2 Programming
Programming tasks: H4P2T1 and H4P2T2 using the template ECG_Keras_template.ipynb or
ECG_Pytorch_template.ipynb. You may choose to use Keras or Pytorch.
The number of points for each question/task
1. Backpropagation 7 7
2. Computational Graph 20 20
3. Output Target Normalization 2 2
4. Activations for Regression extra 5 points 6
5. Normalization inside Network 2 2
6. Skip connection 2 2
7. Randomness 2 1
8. ReLU and piecewise linear N.A. extra 5 points
H4P2T1 (MLP) 30 30
H4P2T2 (CNN) 35 30
In H4P2T1, you will implement an MLP with residual connections for ECG signal classification according to the
diagram below. You will lose points if your network deviates from the diagram: one deviation costs 10 points.
The following actions are deviations: miss a connection, add an extra connection, miss a layer, add an extra layer,
or use different parameters (not those defined in the diagram) for the pooling layers. Note: Softmax is optional in
Pytorch, so missing Softmax in Pytorch is not a deviation.
For these linear/dense layers: you are free to choose the internal parameters (e.g., the number of units in a layer).
For these normalization layers: you choose a feasible normalization method from GroupNorm, InstanceNorm,
BatchNorm, and LayerNorm (see Q5).
For these average pooling layers: the pooling window size (named pool_size in Keras, kernel_size in Pytorch) is
fixed to 2, and the stride is fixed to 2 (it is named strides in Keras). You may add padding if necessary.
The accuracy on the test set is about 89%. You will get zero score if the test accuracy < 85%
Input x (2D)
Linear/Dense (linear1)
ReLU
Linear/Dense (linear2)
ReLU
+
Normalization (norm3)
Linear/Dense (linear3)
ReLU
+
pooling
(avg1)
pooling
(avg2)
Normalization (norm4)
Linear/Dense (linear4)
ReLU
+
pooling
(avg3)
Normalization (norm5)
Linear/Dense (linear5)
ReLU
Linear/Dense (linear6)
Output z
Softmax
Output y_hat
average pooling (1D)
window size = 2
stride=2
average pooling (1D)
window size = 2
stride=2
average pooling (1D)
window size = 2
stride=2
channel axis
only used in Keras
+
xb
xa
xc xc=xa+xb
In H4P2T2, you will implement a CNN with residual connections for ECG signal classification according to the
diagram below. You will lose points if your network deviates from the diagram: one deviation costs 10 points.
The following actions are deviations: miss a connection, add an extra connection, miss a layer, add an extra layer,
or use different parameters (not those defined in the diagram) for the pooling layers. Note: Softmax is optional in
Pytorch, so missing Softmax in Pytorch is not a deviation.
For these convolution and linear/dense layers: you are free to choose the internal parameters (e.g., the number
of kernels/filters, the number of input channels, the number of output channels, stride, padding, etc).
For these normalization layers: you choose a feasible normalization method from GroupNorm, InstanceNorm,
BatchNorm, and LayerNorm (see Q5).
For these average pooling layers: the pooling window size (named pool_size in Keras, kernel_size in Pytorch) is
fixed to 2, and the stride is fixed to 2 (it is named strides in Keras). You may add padding if necessary.
The accuracy on the test set is about 90%. You will get zero score if the test accuracy < 85%
Input x (3D)
Convolution (conv1)
ReLU
Convolution (conv2)
ReLU
+
Normalization (norm3)
Convolution (conv3)
ReLU
+
pooling
(avg1)
pooling
(avg2)
Normalization (norm4)
Convolution (conv4)
ReLU
+
pooling
(avg3)
reshape/view/flatten
3D => 2D
Linear/Dense (fc1)
ReLU
Normalization (norm5)
Convolution (conv5)
ReLU
Linear/Dense (fc2)
Output z
Softmax
Output y_hat
average pooling (1D)
window size = 2
stride=2
average pooling (1D)
window size = 2
stride=2
average pooling (1D)
window size = 2
stride=2
+
xb
xa
xc xc=xa+xb