## Description

## 1 True/False Questions (20 pts)

For each question, please provide a short explanation to support your judgment.

Problem 1.1 (2 pts) Generally speaking, the weight pruning does not intervene the weight quantization,

as they are orthogonal techniques in compressing the size of DNN models.

Problem 1.2 (2 pts) In weight pruning techniques, the distribution of the remaining weights does not

affect the inference latency.

Problem 1.3 (2 pts) In deep compression pipeline, even if we skip the quantization step, the pruned

model can still be effectively encoded by the following Huffman coding process as pruning greatly reduces

the number of weight variables.

Problem 1.4 (2 pts) Directly using SGD to optimize a sparsity-inducing regularizer (i.e. L-1, DeepHoyer

etc.) with the training loss will lead to exact zero values in the weight elements, there’s no need to apply

additional pruning step after the optimization process.

Problem 1.5 (2 pts) Using soft thresholding operator will lead to better results comparing to using L-1

regularization directly as it solves the “bias” problem of L-1.

Problem 1.6 (2 pts) Group Lasso can lead to structured sparsity on DNNs, which is more hardwarefriendly. The idea of Group Lasso comes from applying L-2 regularization to the L-1 norm of all of the

groups.

Problem 1.7 (2 pts) Proximal gradient descent introduces an additional proximity term to minimize regularization loss in the proximity of weight parameters. The proximity term allows smoother convergence

of the overall objective.

Problem 1.8 (2 pts) Models equipped with early exits allows some inputs to be computed with only

part of the model, thus overcoming the issue of overfitting and overthinking.

Problem 1.9 (2 pts) When implementing quantization-aware training with STE, gradients are quantized

during backpropagation, ensuring that updates are consistent with the quantized weights.

Problem 1.10 (2 pts) Comparing to quantizing all the layers in a DNN to the same precision, mixedprecision quantization scheme can reach higher accuracy with a similar-sized model.

## 2 Lab 1: Sparse optimization of linear models (30 pts)

By now you have seen multiple ways to induce a sparse solution in the optimization process. This problem

will provide you some examples under linear regression setting so that you can compare the effectiveness

of different methods.

For this problem, consider the case where we are trying to find a sparse weight W

that can minimize L =

P

i

(XiW − yi)

2

. Specifically, we have Xi ∈ R

1×5

, W ∈ R

5×1 and ||W||0 ≤ 2.

For Problem (a) – (f), consider the case where we have 3 data points: (X1 = [1, −2, −1, −1, 1], y1 = 7);

(X2 = [2, −1, 2, 0, −2], y2 = 1); (X3 = [−1, 0, 2, 2, 1], y3 = 1). For stability the objective L should be

minimized through full-batch gradient descent, with initial weight W0

set to [0; 0; 0; 0; 0] and use learning

rate µ = 0.02 throughout the process. Please run gradient descent for 200 steps for all the following

problems.

You will need to use NumPy to finish this set of questions, please put all your code for this set of

questions into one python/notebook file and submit it on Sakai. Please include all your results, figures

and observations into your PDF report.

(a) (4 pts) Theoretical analysis: with learning rate µ, suppose the weight you have after step k is Wk

,

derive the symbolic formulation of weight Wk+1 after step k+1 of full-batch gradient descent with

Xi

, yi

, i ∈ {1, 2, 3}. (Hint: note the loss L we have is defined differently from standard MSE loss.)

(b) (3 pts) In Python, directly minimize the objective L without any sparsity-inducing regularization/constraint. Plot the value of log(L) vs. #steps throughout the training, and use another

figure to plot how the value of each element in W is changing throughout the training. From

your result, is W converging to an optimal solution? Is W converging to a sparse solution?

(c) (6 pts) Since we have the knowledge that the ground-truth weight should have ||W||0 ≤ 2, we

can apply projected gradient descent to enforce this sparse constraint. Redo the optimization

process in (b), this time prune the elements in W after every gradient descent step to ensure

||Wl

||0 ≤ 2. Plot the value of log(L) throughout the training, and use another figure to plot

the value of each element in W in each step. From your result, is W converging to an optimal

solution? Is W converging to a sparse solution?

(d) (5 pts) In this problem we apply ℓ1 regularization to induce the sparse solution. The minimization

objective therefore changes to L + λ||W||1. Please use full-batch gradient descent to minimize

this objective, with λ = {0.2, 0.5, 1.0, 2.0} respectively. For each case, plot the value of log(L)

throughout the training, and use another figure to plot the value of each element in W in each

step. From your result, comment on the convergence performance under different λ.

(e) (6 pts) Here we optimize the same objective as in (d), this time using proximal gradient update.

Recall that the proximal operator of the ℓ1 regularizer is the soft thresholding function. Set the

threshold in the soft thresholding function to {0.004, 0.01, 0.02, 0.04} respectively. Plot the value

of log(L) throughout the training, and use another figure to plot the value of each element in W

in each step. Compare the convergence performance with the results in (d). (Hint: Optimizing

L + λ||W||1 using gradient descent with learning rate µ should correspond to proximal gradient

update with threshold µλ)

(f) (6 pts) Trimmed ℓ1 (T ℓ1) regularizer is proposed to solve the “bias” problem of ℓ1. For simplicity

you may implement the T ℓ1 regularizer as applying a ℓ1 regularization with strength λ on the 3

elements of W with the smallest absolute value, with no penalty on other elements. Minimize

L+λT ℓ1(W) using proximal gradient update with λ = {1.0, 2.0, 5.0, 10.0} (correspond the soft

thresholding threshold {0.02, 0.04, 0.1, 0.2}). Plot the value of log(L) throughout the training,

and use another figure to plot the value of each element in W in each step. Comment on the

convergence comparison of the Trimmed ℓ1 and the ℓ1. Also compare the behavior of the early

steps (e.g. first 20) between the Trimmed ℓ1 and the iterative pruning.

## 3 Lab 2: Pruning ResNet-20 model (25 pts)

ResNet-20 is a popular convolutional neural network (CNN) architecture for image classification. Compared to early CNN designs such as VGG-16, ResNet-20 is much more compact. Thus, conducing the model

compression on ResNet-20 is more challenging.

This lab explores the element-wise pruning of ResNet-20 model on CIFAR-10 dataset. We will observe

the difference between single step pruning and iterative pruning, plus exploring different ways of setting

pruning threshold. Everything you need for this lab can be found in HW4.zip.

(a) (2 pts) In hw4.ipynb, run through the first three code block, report the accuracy of the floatingpoint pretrained model.

(b) (6 pts) Complete the implementation of pruning by percentage function in the notebook. Here

we determines the pruning threshold in each DNN layer by the ‘q-th percentile’ value in the

absolute value of layer’s weight element. Use the next block to call your implemented pruning

by percentage. Try pruning percentage q = 0.3, 0.5, 0.7. Report the test accuracy q. (Hint: You

need to reload the full model checkpoint before applying the prune function with a different q ).

(c) (6 pts) Fill in the finetune_after_prune function for pruned model finetuning. Make sure the

pruned away elements in previous step are kept as 0 throughout the finetuning process. Finetune

the pruned model with q=0.7 for 20 epochs with the provided training pipeline. Report the

best accuracy achieved during finetuning. Finish the code for sparsity evaluation to check if the

finetuned model preserves the sparsity.

(d) (5 pts) Implement iterative pruning. Instead of applying single step pruning before finetuning, try

iteratively increase the sparsity of the model before each epoch of finetuning. Linearly increase

the pruning percentage for 10 epochs until reaching 70% in the final epoch (prune (7 × e)%

before epoch e) then continue finetune for 10 epochs. Pruned weight can be recovered during

the iterative pruning process before the final pruning step. Compare performance with (c)

(e) (6 pts) Perform magnitude-based global iterative pruning. Previously we set the pruning threshold of each layer following the weight distribution of the layer and prune all layers to the same

sparsity. This will constrain the flexibility in the final sparsity pattern across layers. In this question, Fill in the global_prune_by_percentage function to perform a global ranking of the weight

magnitude from all the layers, and determine a single pruning threshold by percentage for all the

layers. Repeat iterative pruning to 70% sparsity, and report final accuracy and the percentage of

zeros in each layer.

## 4 Lab 3: Fixed-point quantization and finetuning (25 pts)

Besides pruning, fixed-point quantization is another important technique applied for deep neural network

compression. In this Lab, you will convert the ResNet-20 model we used in previous lab into a quantized

model, evaluate is performance and apply finetuning on the model.

(a) (10 pts) As is mentioned in lecture 15, to train a quantized model we need to use floatingpoint weight as trainable variable while use a straight-through estimator (STE) in forward and

backward pass to convert the weight into quantized value. Intuitively, the forward pass of STE

converts a float weight into fixed-point, while the backward pass passes the gradient straightly

through the quantizer to the float weight.

To start with, implement the STE forward function in FP_layers.py, so that it serves as a linear

quantizer with dynamic scaling, as introduced on page 9 of lecture 15. Please follow the comments in the code to figure out the expected functionality of each line. Take a screen shot of

the finished STE class and paste it into the report. Submission of the FP_layers.py file is not

required. (Hint: Please consider zeros in the weight as being pruned away, and build a mask to

ensure that STE is only applied on non-zero weight elements for quantization. )

(b) (2 pts) In hw4.ipynb, load pretrained ResNet-20 model, report the accuracy of the floating-point

pretrained model. Then set Nbits in the first line of block 4 to 6, 5, 4, 3, and 2 respectively, run

it and report the test accuracy you got. (Hint: In this block the line defining the ResNet model

(second line) will set the residual blocks in all three stages to Nbits fixed-point, while keeping

the first conv and final FC layer still as floating point.)

(c) (5 pts) With Nbits set to 4, 3, and 2 respectively, run code block 4 and 5 to finetune the quantized

model for 20 epochs. You do not need to change other parameter in the finetune function. For

each precision, report the highest testing accuracy you get during finetuning. Comment on the

relationship between precision and accuracy, and on the effectiveness of finetuning.

(d) (4 pts) In practice, we want to apply both pruning and quantization on the DNN model. Here we

explore how pruning will affect quantization performance. Please load the checkpoint of the 70%

sparsity model with the best accuracy from Lab 2, repeat the process in (c), report the accuracy

before and after finetuning, and discuss your observations comparing to (c)’s results.

(e) (4 pts) Symmetric quantization is a commonly used and hardware-friendly quantization approach. In symmetric quantization, the quantization levels are symmetric to zero. Implement

symmetric quantization in the STE class and repeat the process in (b). Compare and analyze the

performance of symmetric quantization and asymmetric quantization.