Deep Learning in Hardware Homework 3 ECE 498/598

$30.00

Category: Tags: , , , , You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (3 votes)

Problem 1: Training Neural Networks using Python with Quantized Gradients
In this problem, you will need the MNIST dataset from http://yann.lecun.com/exdb/
mnist/. You can initiate the dataset directly using the PyTorch API. Your job is to write a
Python code to train a neural network with an MLP architecture of 784-512-256-256-10 that
performs the classification task on the MNIST dataset. You are to use the vanilla version
of SGD. You are encouraged to read through all questions in this problem before starting to
code.
1. You can find tutorials on how to adjust learrning rate in PyTorch at https://
pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate. In this
problem, train the network with a: constant learning rate, b: a cosine annealing
learning rate, and c: a step-decaying learning rate (learning rate decreases every x
epochs). Plot the convergence curve of the test error of your network as a function of
time (measured in epochs) for each learning rate scheduler.
2. For the model trained using scheduler a, plot the per-tensor variance of each weight
gradient as a function of time (measured in epochs), and record the maximum and
minimum values of recorded variances during training.
3. You are asked to retrain this network with scheduler a, but using close-to-minimal
(CTM) quantized weight gradients. The challenge is to determine a suitable clipping
level and quantization step-size (Hint: use your answers above). Show convergence
curves of training with CTM quantized gradients and report the number of bits used
for each tensor.
4. Compare the test error convergence curve obtained using CTM quantized weight gradients in Part 2, with those obtained by quantizing the weight gradients to: a) 6-b,
and b)7-b, for all layers.
Problem 2: Deriving Weight Initialization Formulas
In this problem, your task is to derive the two key equations utilized in the He initialization scheme (https://arxiv.org/abs/1502.01852). In what follows, we assume weights
are zero-mean and independent. Further, activations are independent but non-zero mean
1
Deep Learning in Hardware
Prof. Naresh Shanbhag Homework 3
ECE 498/598 (Fall 2020)
Assigned: Oct. 2 – Due: Oct. 9
(because of their rectifying nature).
1. Show that during for the forward propagation we have:
V ar(yL) = V ar(y1)
Y
L
l=2
1
2
nlV ar(wl)
!
where yl
, wl
, and nl are the pre-activations, weights, and forward dot-product length
at layer l and, L is the number of layers.
2. Show that during the backward propagation we have:
V ar(∆x2) = V ar(∆xL+1)
Y
L
l=2
1
2
nˆlV ar(wl)
!
where ∆xl+1 and ˆnl are the activation gradients and backward dot-product length at
layer l.
3. In a convolutional layer, how are the forward dot-product length nl and backward
dot-product length ˆnl related?
4. Explain how the He initialization prevents the vanishing or explosion of activations in
the forward propagation.
5. Explain how the He initialization prevents the vanishing or explosion of gradients in
the backward propagation (HINT: the answer to this question is not the same as that
of the previous one).
Problem 3: BatchNorm Absorption
Recall in Batch Normalization, the output hw, xi of a layer with activation or feature map
x with a weight or filter w is transformed as:
γ
hw, xi − µ

σ
2 +
+ β
where γ, β, µ, and σ are the BatchNorm (BN) parameters, is a numerical stability constant,
and < ·, · > denotes the dot product or convolution depending on the context.
If only inference needs to be performed, it is possible to eliminate the extra computations
required by BN by absorbing its parameters into the weights. This can be rewriting the
above equation as:
hwˆ , xi + b
2
Deep Learning in Hardware
Prof. Naresh Shanbhag Homework 3
ECE 498/598 (Fall 2020)
Assigned: Oct. 2 – Due: Oct. 9
where wˆ and b are a new weight/filter and a bias, respectively.
1. Derive expressions for wˆ and b as a function of w, γ, β, µ, σ, and ?
2. For the case of convolutions, write a Python script that returns the BN 4D tensor wˆ
and 1D vector b for standard 2D convolution given a original weight tensor w and BN
parameters γ, β, µ, σ, and . You only need to submit a script for this question (Hint:
the solution is not trivial and requires broadcasting).
3. For the MLP model in Problem 1, add a BN layer after each of the first three fully
connected layers and retrain the network from scratch. Plot the convergence of test
error for the MLP before and after adding the BN layer. Pay attention to the order
between the BN layer and the ReLU unit. How many parameters does this new MLP
have compared to the old one?
3