ECE 661: Homework 4 Pruning and Fixed-point Quantization


Category: You will Instantly receive a download link for .zip solution file upon Payment


5/5 - (1 vote)

1 True/False Questions (20 pts)

For each question, please provide a short explanation to support your judgment.
Problem 1.1 (2 pts) Generally speaking, the weight pruning does not intervene the weight quantization,
as they are orthogonal techniques in compressing the size of DNN models.

Problem 1.2 (2 pts) In weight pruning techniques, the distribution of the remaining weights does not
affect the inference latency.

Problem 1.3 (2 pts) In deep compression pipeline, even if we skip the quantization step, the pruned
model can still be effectively encoded by the following Huffman coding process as pruning greatly reduces
the number of weight variables.

Problem 1.4 (2 pts) Directly using SGD to optimize a sparsity-inducing regularizer (i.e. L-1, DeepHoyer
etc.) with the training loss will lead to exact zero values in the weight elements, there’s no need to apply
additional pruning step after the optimization process.

Problem 1.5 (2 pts) Using soft thresholding operator will lead to better results comparing to using L-1
regularization directly as it solves the “bias” problem of L-1.

Problem 1.6 (2 pts) Group Lasso can lead to structured sparsity on DNNs, which is more hardwarefriendly. The idea of Group Lasso comes from applying L-2 regularization to the L-1 norm of all of the

Problem 1.7 (2 pts) Proximal gradient descent introduces an additional proximity term to minimize regularization loss in the proximity of weight parameters. The proximity term allows smoother convergence
of the overall objective.

Problem 1.8 (2 pts) Models equipped with early exits allows some inputs to be computed with only
part of the model, thus overcoming the issue of overfitting and overthinking.

Problem 1.9 (2 pts) When implementing quantization-aware training with STE, gradients are quantized
during backpropagation, ensuring that updates are consistent with the quantized weights.

Problem 1.10 (2 pts) Comparing to quantizing all the layers in a DNN to the same precision, mixedprecision quantization scheme can reach higher accuracy with a similar-sized model.

2 Lab 1: Sparse optimization of linear models (30 pts)

By now you have seen multiple ways to induce a sparse solution in the optimization process. This problem
will provide you some examples under linear regression setting so that you can compare the effectiveness
of different methods.

For this problem, consider the case where we are trying to find a sparse weight W
that can minimize L =
(XiW − yi)
. Specifically, we have Xi ∈ R
, W ∈ R
5×1 and ||W||0 ≤ 2.

For Problem (a) – (f), consider the case where we have 3 data points: (X1 = [1, −2, −1, −1, 1], y1 = 7);
(X2 = [2, −1, 2, 0, −2], y2 = 1); (X3 = [−1, 0, 2, 2, 1], y3 = 1). For stability the objective L should be
minimized through full-batch gradient descent, with initial weight W0
set to [0; 0; 0; 0; 0] and use learning
rate µ = 0.02 throughout the process. Please run gradient descent for 200 steps for all the following

You will need to use NumPy to finish this set of questions, please put all your code for this set of
questions into one python/notebook file and submit it on Sakai. Please include all your results, figures
and observations into your PDF report.

(a) (4 pts) Theoretical analysis: with learning rate µ, suppose the weight you have after step k is Wk
derive the symbolic formulation of weight Wk+1 after step k+1 of full-batch gradient descent with
, yi
, i ∈ {1, 2, 3}. (Hint: note the loss L we have is defined differently from standard MSE loss.)

(b) (3 pts) In Python, directly minimize the objective L without any sparsity-inducing regularization/constraint. Plot the value of log(L) vs. #steps throughout the training, and use another
figure to plot how the value of each element in W is changing throughout the training. From
your result, is W converging to an optimal solution? Is W converging to a sparse solution?

(c) (6 pts) Since we have the knowledge that the ground-truth weight should have ||W||0 ≤ 2, we
can apply projected gradient descent to enforce this sparse constraint. Redo the optimization
process in (b), this time prune the elements in W after every gradient descent step to ensure
||0 ≤ 2. Plot the value of log(L) throughout the training, and use another figure to plot
the value of each element in W in each step. From your result, is W converging to an optimal
solution? Is W converging to a sparse solution?

(d) (5 pts) In this problem we apply ℓ1 regularization to induce the sparse solution. The minimization
objective therefore changes to L + λ||W||1. Please use full-batch gradient descent to minimize
this objective, with λ = {0.2, 0.5, 1.0, 2.0} respectively. For each case, plot the value of log(L)
throughout the training, and use another figure to plot the value of each element in W in each
step. From your result, comment on the convergence performance under different λ.

(e) (6 pts) Here we optimize the same objective as in (d), this time using proximal gradient update.
Recall that the proximal operator of the ℓ1 regularizer is the soft thresholding function. Set the
threshold in the soft thresholding function to {0.004, 0.01, 0.02, 0.04} respectively. Plot the value
of log(L) throughout the training, and use another figure to plot the value of each element in W
in each step. Compare the convergence performance with the results in (d). (Hint: Optimizing
L + λ||W||1 using gradient descent with learning rate µ should correspond to proximal gradient
update with threshold µλ)

(f) (6 pts) Trimmed ℓ1 (T ℓ1) regularizer is proposed to solve the “bias” problem of ℓ1. For simplicity
you may implement the T ℓ1 regularizer as applying a ℓ1 regularization with strength λ on the 3
elements of W with the smallest absolute value, with no penalty on other elements. Minimize
L+λT ℓ1(W) using proximal gradient update with λ = {1.0, 2.0, 5.0, 10.0} (correspond the soft
thresholding threshold {0.02, 0.04, 0.1, 0.2}). Plot the value of log(L) throughout the training,
and use another figure to plot the value of each element in W in each step. Comment on the
convergence comparison of the Trimmed ℓ1 and the ℓ1. Also compare the behavior of the early
steps (e.g. first 20) between the Trimmed ℓ1 and the iterative pruning.

3 Lab 2: Pruning ResNet-20 model (25 pts)

ResNet-20 is a popular convolutional neural network (CNN) architecture for image classification. Compared to early CNN designs such as VGG-16, ResNet-20 is much more compact. Thus, conducing the model
compression on ResNet-20 is more challenging.

This lab explores the element-wise pruning of ResNet-20 model on CIFAR-10 dataset. We will observe
the difference between single step pruning and iterative pruning, plus exploring different ways of setting
pruning threshold. Everything you need for this lab can be found in

(a) (2 pts) In hw4.ipynb, run through the first three code block, report the accuracy of the floatingpoint pretrained model.

(b) (6 pts) Complete the implementation of pruning by percentage function in the notebook. Here
we determines the pruning threshold in each DNN layer by the ‘q-th percentile’ value in the
absolute value of layer’s weight element. Use the next block to call your implemented pruning
by percentage. Try pruning percentage q = 0.3, 0.5, 0.7. Report the test accuracy q. (Hint: You
need to reload the full model checkpoint before applying the prune function with a different q ).

(c) (6 pts) Fill in the finetune_after_prune function for pruned model finetuning. Make sure the
pruned away elements in previous step are kept as 0 throughout the finetuning process. Finetune
the pruned model with q=0.7 for 20 epochs with the provided training pipeline. Report the
best accuracy achieved during finetuning. Finish the code for sparsity evaluation to check if the
finetuned model preserves the sparsity.

(d) (5 pts) Implement iterative pruning. Instead of applying single step pruning before finetuning, try
iteratively increase the sparsity of the model before each epoch of finetuning. Linearly increase
the pruning percentage for 10 epochs until reaching 70% in the final epoch (prune (7 × e)%
before epoch e) then continue finetune for 10 epochs. Pruned weight can be recovered during
the iterative pruning process before the final pruning step. Compare performance with (c)

(e) (6 pts) Perform magnitude-based global iterative pruning. Previously we set the pruning threshold of each layer following the weight distribution of the layer and prune all layers to the same
sparsity. This will constrain the flexibility in the final sparsity pattern across layers. In this question, Fill in the global_prune_by_percentage function to perform a global ranking of the weight
magnitude from all the layers, and determine a single pruning threshold by percentage for all the
layers. Repeat iterative pruning to 70% sparsity, and report final accuracy and the percentage of
zeros in each layer.

4 Lab 3: Fixed-point quantization and finetuning (25 pts)

Besides pruning, fixed-point quantization is another important technique applied for deep neural network
compression. In this Lab, you will convert the ResNet-20 model we used in previous lab into a quantized
model, evaluate is performance and apply finetuning on the model.

(a) (10 pts) As is mentioned in lecture 15, to train a quantized model we need to use floatingpoint weight as trainable variable while use a straight-through estimator (STE) in forward and
backward pass to convert the weight into quantized value. Intuitively, the forward pass of STE
converts a float weight into fixed-point, while the backward pass passes the gradient straightly
through the quantizer to the float weight.

To start with, implement the STE forward function in, so that it serves as a linear
quantizer with dynamic scaling, as introduced on page 9 of lecture 15. Please follow the comments in the code to figure out the expected functionality of each line. Take a screen shot of
the finished STE class and paste it into the report. Submission of the file is not
required. (Hint: Please consider zeros in the weight as being pruned away, and build a mask to
ensure that STE is only applied on non-zero weight elements for quantization. )

(b) (2 pts) In hw4.ipynb, load pretrained ResNet-20 model, report the accuracy of the floating-point
pretrained model. Then set Nbits in the first line of block 4 to 6, 5, 4, 3, and 2 respectively, run
it and report the test accuracy you got. (Hint: In this block the line defining the ResNet model
(second line) will set the residual blocks in all three stages to Nbits fixed-point, while keeping
the first conv and final FC layer still as floating point.)

(c) (5 pts) With Nbits set to 4, 3, and 2 respectively, run code block 4 and 5 to finetune the quantized
model for 20 epochs. You do not need to change other parameter in the finetune function. For
each precision, report the highest testing accuracy you get during finetuning. Comment on the
relationship between precision and accuracy, and on the effectiveness of finetuning.

(d) (4 pts) In practice, we want to apply both pruning and quantization on the DNN model. Here we
explore how pruning will affect quantization performance. Please load the checkpoint of the 70%
sparsity model with the best accuracy from Lab 2, repeat the process in (c), report the accuracy
before and after finetuning, and discuss your observations comparing to (c)’s results.

(e) (4 pts) Symmetric quantization is a commonly used and hardware-friendly quantization approach. In symmetric quantization, the quantization levels are symmetric to zero. Implement
symmetric quantization in the STE class and repeat the process in (b). Compare and analyze the
performance of symmetric quantization and asymmetric quantization.