Description
1 Weight Decay
Here, we will develop further intuitions on how adding weight decay can influence the solution
space. For a refresher on generalization, please refer to: https://csc413-2020.github.io/
assets/readings/L07.pdf. Consider the following linear regression model with weight decay.
J (wˆ ) = 1
2n
kXwˆ − tk
2
2 +
λ
2
wˆ
>wˆ
where X ∈ R
n×d
, ❅❅Y t ∈ R
n
, and wˆ ∈ R
d
. n is the number of data points and d is the data
dimension. X is the design matrix in HW1.
1.1 Underparameterized Model [0pt]
First consider the underparameterized d ≤ n case. Write down the solution obtained by gradient
descent assuming training converges. Is the solution unique? If the solution involves inverting
matrices, explain why it is invertible.
1.2 Overparameterized Model
1.2.1 Warmup: Visualizing Weight Decay [1pt]
Now consider the overparameterized d > n case. We start with a 2D example from HW1. For a
single training example x1 = [2, 1] and t1 = 2. First, 1) draw the solution space of the squared
error on a 2D plane. Then, 2) draw the the coutour plot of the weight decay term λ
2 wˆ
>wˆ .
Include the plot in the report. Also indicate on the plot where the gradient descent solutions
are with and without weight decay. (Precious drawings are not required for the full mark.)
1.2.2 Gradient Descent and Weight Decay [0pt]
Derive the solution obtained by gradient descent at convergence in the overparameterized case. Is
this the same solution from Homework1 3.4.1?
1.3 Adaptive optimizer and Weight Decay [1pt]
In HW2 Section 1.2, we saw that per-parameter adaptive methods, such as AdaGrad, Adam, do
not converge to the least norm solution due to moving out of the row space of our design matrix
X.
Assume AdaGrad converges to an optimal in the training objective. Does weight decay help
AdaGrad converge to a solution in the row space? Give a brief justification.
(Hint: build intuition from the 2-D toy example.)
2 Ensembles and Bias-variance Decomposition
In the prerequisite CSC311 https://amfarahmand.github.io/csc311/lectures/lec04.pdf, we
have seen the bias-variance decomposition. The following question uses the same notation as taught