Description
1 Optimization
This week, we will continue investigating the properties of optimization algorithms, focusing on
stochastic gradient descent and adaptive gradient descent methods. For a refresher on optimization,
refer to: https://csc413-uoft.github.io/2021/assets/slides/lec03.pdf.
We will continue using the linear regression model established in Homework 1. Given n pairs
of input data with d features and scalar labels (xi
, ti) ∈ R
d × R, we want to find a linear model
f(x) = wˆ
T x with wˆ ∈ R
d
such that the squared error on training data is minimized. Given a data
matrix X ∈ R
n×d and corresponding labels t ∈ R
n
, the objective function is defined as:
L =
1
n
∥Xwˆ − t∥
2
2
(1)
1.1 Mini-Batch Stochastic Gradient Descent (SGD)
Mini-batch SGD performs optimization by taking the average gradient over a mini-batch, denoted
B ∈ R
b×d
, where 1 < b ≪ n. Each training example in the mini-batch, denoted xj ∈ B, is
1
https://markus.teach.cs.toronto.edu/2022-01/courses/16
2
https://uoft-csc413.github.io/2022/assets/misc/syllabus.pdf
3
https://piazza.com/utoronto.ca/winter2022/csc4132516/
1
CSC413/2516 Winter 2022 with Professor Jimmy Ba and Professor Bo Wang Homework 2
randomly sampled without replacement from the data matrix X. Assume that X is full rank.
Where L denotes the loss on xj , the update for a single step of mini-batch SGD at time t with
scalar learning rate η is:
wt+1 ← wt −
η
b
X
xj∈B
∇wtL(xj , wt) (2)
Mini-batch SGD iterates by randomly drawing mini-batches and updating model weights using the
above equation until convergence is reached.
1.1.1 Minimum Norm Solution [2pt]
Recall Question 3.3 from Homework 1. For an overparameterized linear model, gradient descent
starting from zero initialization finds the unique minimum norm solution w∗
such that Xw∗ = t.
Let w0 = 0, d > n. Assume mini-batch SGD also converges to a solution wˆ such that Xwˆ = t.
Show that mini-batch SGD solution is identical to the minimum norm solution w∗ obtained by
gradient descent, i.e., ˆw = w∗
.
Hint: Is xj or B contained in span of X? Do the update steps of mini-batch SGD ever leave
the span of X?
1.2 Adaptive Methods
We now consider the behavior of adaptive gradient descent methods. In particular, we will investigate the RMSProp method. Let wi denote the i-th parameter. A scalar learning rate η is used.
At time t for parameter i, the update step for RMSProp is shown by:
wi,t+1 = wi,t −
η
√vi,t + ϵ
∇wi,tL(wi,t) (3)
vi,t = β(vi,t−1) + (1 − β)(∇wi,tL(wi,t))2
(4)
We begin the iteration at t = 0, and set vi,−1 = 0. The term ϵ is a fixed small scalar used for
numerical stability. The momentum parameter β is typically set such that β ≥ 0.9 Intuitively,
RMSProp adapts a separate learning rate in each dimension to efficiently move through badly
formed curvatures (see lecture slides/notes).
1.2.1 Minimum Norm Solution [1pt]
Consider the overparameterized linear model (d > n) for the loss function defined in Section 1.
Assume the RMSProp optimizer converges to a solution. Provide a proof or counterexample for
whether RMSProp always obtains the minimum norm solution.
Hint: Compute a simple 2D case. Let x1 = [2, 1], w0 = [0, 0], t = [2].
1.2.2 [0pt]
Consider the result from the previous section. Does this result hold true for other adaptive methods
(Adagrad, Adam) in general? Why might making learning rates independent per dimension be
desirable?
2
CSC413/2516 Winter 2022 with Professor Jimmy Ba and Professor Bo Wang Homework 2
2 Gradient-based Hyper-parameter Optimization
In this problem, we will implement a simple toy example of gradient-based hyper-parameter optimization, introduced in Lecture 3 (slides 14).
Often in practice, hyper-parameters are chosen by trial-and-error based on a model evaluation
criterion. Instead, gradient-based hyper-parameter optimization computes gradient of the evaluation
criterion w.r.t. the hyper-parameters and uses this gradient to directly optimize for the best set of
hyper-parameters. For this problem, we will optimize for the learning rate of gradient descent in a
regularized linear regression problem.
Specifically, given n pairs of input data with d features and scalar label (xi
, ti) ∈ R
d × R, we
wish to find a linear model f(x) = wˆ
⊤x with wˆ ∈ R
d and a L2 penalty, λ˜∥wˆ
2
2
∥, that minimizes
the squared error of prediction on the training samples. λ˜ is a hyperparameter that modulates the
impact of the L2 regularization on the loss function. Using the concise notation for the data matrix
X ∈ R
n×d and the corresponding label vector t ∈ R
n
, the squared error loss can be written as:
L˜ =
1
n
∥Xwˆ − t∥
2
2 + λ˜∥wˆ ∥
2
2
.
Starting with an initial weight parameters w0, gradient descent (GD) updates w0 with a learning
rate η for t number of iterations. Let’s denote the weights after t iterations of GD as wt
, the loss as
Lt
, and its gradient as ∇wt
. The goal is the find the optimal learning rate by following the gradient
of Lt w.r.t. the learning rate η.
2.1 Computation Graph
2.1.1 [0.5pt]
Consider a case of 2 GD iterations. Draw the computation graph to obtain the final loss L˜
2 in
terms of w0, ∇w0L˜
0,L˜
0, w1,L˜
1, ∇w1L˜
1, w2, λ˜ and η.
2.1.2 [0.5pt]
Then, consider a case of t iterations of GD. What is the memory complexity for the forwardpropagation in terms of t? What is the memory complexity for using the standard back-propagation
to compute the gradient w.r.t. the learning rate, ∇ηL˜
t
in terms of t?
Hint: Express your answer in the form of O in terms of t.
2.1.3 [0pt]
Explain one potential problem for applying gradient-based hyper-parameter optimization in more
realistic examples where models often take many iterations to converge.
2.2 Optimal Learning Rates
In this section, we will take a closer look at the gradient w.r.t. the learning rate. To simplify the
computation for this section, consider an unregularized loss function of the form L =
1
n
∥Xwˆ − t∥
2
2
.
Let’s start with the case with only one GD iteration, where GD updates the model weights from
w0 to w1.
3
CSC413/2516 Winter 2022 with Professor Jimmy Ba and Professor Bo Wang Homework 2
2.2.1 [1pt]
Write down the expression of w1 in terms of w0, η, t and X. Then use the expression to derive
the loss L1 in terms of η.
Hint: If the expression gets too messy, introduce a constant vector a = Xw0 − t
2.2.2 [0pt]
Determine if L1 is convex w.r.t. the learning rate η.
Hint: A function is convex if its second order derivative is positive
2.2.3 [1pt]
Write down the derivative of L1 w.r.t. η and use it to find the optimal learning rate η
∗
that
minimizes the loss after one GD iteration. Show your work.
2.3 Weight decay and L2 regularization
Although well studied in statistics, L2 regularization is usually replaced with explicit weight decay
in modern neural network architectures:
wi+1 = (1 − λ)wi − η∇Li(X) (5)
In this question you will compare regularized regression of the form L˜ =
1
n
∥Xwˆ − t∥
2
2 + λ˜∥wˆ ∥
2
2
with unregularized loss, L =
1
n
∥Xwˆ − t∥
2
2
, accompanied by weight decay (equation 5).
2.3.1 [0.5pt]
Write down two expressions for w1 in terms of w0, η, t, λ, λ˜, and X. The first one using L˜, the
second with L and weight decay.
2.3.2 [0.5pt]
How can you express λ˜ (corresponding to L2 loss) so that it is equivalent to λ (corresponding to
weight decay)?
Hint: Think about how you can express λ˜ in terms of λ and another hyperparameter.
2.3.3 [0pt]
Adaptive gradient update methods like RMSprop (equation 4) modulate the learning rate for each
weight individually. Can you describe how L2 regularization is different from weight decay when
adaptive gradient methods are used? In practice it has been shown that for adaptive gradients
methods weight decay is more successful than l2 regularization.
3 Convolutional Neural Networks
The last set of questions aims to build basic familiarity with convolutional neural networks (CNNs).
4
CSC413/2516 Winter 2022 with Professor Jimmy Ba and Professor Bo Wang Homework 2
3.1 Convolutional Filters [0.5pt]
Given the input matrix I and filter J shown below, compute I ∗ J, the output of the convolution
operation (as defined in lecture 4). Assume zero padding is used such that the input and output
are of the same dimension. What feature does this convolutional filter detect?
I =
0 0 0 0 0
0 0 1 1 1
1 1 1 1 0
0 1 1 1 0
0 0 1 0 0
J =
−1 −1 −1
0 0 0
1 1 1
I ∗ J =
? ? ? ? ?
? ? ? ? ?
? ? ? ? ?
? ? ? ? ?
? ? ? ? ?
3.2 Size of Conv Nets [1pt]
CNNs provides several advantages over fully connected neural networks (FCNNs) when applied to
image data. In particular, FCNNs do not scale well to high dimensional image data, which you will
demonstrate below. Consider the following CNN architecture on the left:
The input image has dimension 32 × 32 and is grey-scale (one channel). For ease of computation, assume all convolutional layers only have 1 output channel, and use 3 × 3 kernels. Assume
zero padding is used in convolutional layers such that the output dimension is equal to the input
dimension. Each max pooling layer has a filter size of 2 × 2 and a stride of 2.
We consider an alternative architecture, shown on the right, which replaces convolutional layers
with fully connected (FC) layers in an otherwise identical architecture. For both the CNN architecture and the FCNN architecture, compute the total number of neurons in the network, and the
total number of trainable parameters. You should report four numbers in total. Finally, name one
disadvantage of having more trainable parameters.
3.3 Receptive Fields [0.5pt]
The receptive field of a neuron in a CNN is the area of the image input that can affect the neuron
(i.e. the area a neuron can ‘see’). For example, in the first layer of the previous architecture,
one neuron is computed using an input of the size 3 × 3 (the filter size), so its receptive field
is of size 3 × 3. However, as we go deeper into the CNN, the receptive field increases. The
receptive field of a neuron after the first max pooling is 4 × 4, and the receptive field of a neuron
after the second convolutional layer is 8×8. Some helpful resources to visualize receptive fields can
be found at: https://distill.pub/2019/computing-receptive-fields/ and https://github.
com/vdumoulin/conv_arithmetic4
4See: “A guide to convolution arithmetic for deep learning”, https://arxiv.org/abs/1603.07285
5
CSC413/2516 Winter 2022 with Professor Jimmy Ba and Professor Bo Wang Homework 2
In the previous architecture, suppose we replaced the 3 × 3 sized filter with a 5 × 5 sized filter
in the convolution layers, while keeping max-pooling, stride, padding, etc. constant. What is the
receptive field of a neuron after the second convolutional layer? List 2 other things that can affect
the size of the receptive field of a neuron.
6