Description
1 Trading off Resources in Neural Net Training
1.1 Effect of batch size
When training neural networks, it is important to select appropriate learning hyperparameters such
as learning rate, batch size, and the optimizer for training. In this question, we will investigate the
effect of how varying these quantities will affect the validation loss during training.
1.1.1 Batch size vs. learning rate
Batch size affects the stochasticity in optimization, and therefore affects the choice of learning
rate. We demonstrate this via a simple model called the Noisy Quadratic Model (NQM). Despite its simplicity, the NQM captures many essential features in realistic neural network training [Zhang et al., 2019].
For simplicity, we only consider the scalar version of the NQM. We have the quadratic loss
L(w) = 1
2
aw2
, where a > 0 and w ∈ R is the weight that we would like to optimize. Assume that
we only have access to a noisy version of the gradient — each time when we make a query for the
gradient, we obtain g(w), which is the true gradient ∇L(w) with additive Guassian noise:
g(w) = ∇L(w) + ϵ, ϵ ∼ N (0, σ2
).
SGD updates: w ← w − ηg(w).
One way to reduce noise in the gradient is to use minibatch training. Let B be the batch size,
and denote the minibatch gradient as gB(w):
gB(w) = 1
B
X
B
i=1
gi(w), where gi(w) = ∇L(w) + ϵi
, ϵi
i.i.d. ∼ N (0, σ2
).
Mini-batch SGD updates: w ← w − ηgB(w).
[0.5pt] As batch size increases, how do you expect the optimal learning rate to change? Briefly
explain in 2-3 sentences.
(Hint: Think about how the minibatch gradient noise change with B.)
2
CSC413/2516 Winter 2022 with Professor Jimmy Ba and Professor Bo Wang Homework 3
Figure 1: Illustrations of the typical relationship between training steps and the batch size for
reaching a certain validation loss under various architectures (from [Shallue et al., 2018]). Left:
fully-connected models with various sizes. Right: Convolutional models with different widths.
Learning rate and other related hyperparameters are tuned for each point on the curve.
1.1.2 Training steps vs. batch size
For most of neural network training in the real-world applications, we often observe the relationship
of training steps and batch size for reaching a certain validation loss as illustrated in Figure 1. Such
a relationship can be observed across various architectures.
(a) [0.5pt] For the three points (A, B, C) on Figure 1, which one has the most efficient batch size (in
terms of best resource and training time trade-off)? Assume that you have access to scalable
(but not free) compute such that minibatches are parallelized efficiently. Briefly explain in
1-2 sentences.
(b) [0.5pt] Figure 1 demonstrates that there are often two regimes in neural network training: the noise
dominated regime and the curvature dominated regime. In the noise dominated regime,
the bottleneck for optimization is that there exists a large amount of gradient noise. In
the curvature dominated regime, the bottleneck of optimization is the ill-conditioned loss
landscape. For points A and B on Figure 1, which regimes do they belong to? Fill each of the
blanks with one best suited option. Options: noise dominated / curvature dominated.
Point A: Regime: . Point B: Regime: .
1.1.3 Batch size, Optimizer, Normalization, Learning Rate
In this part, we will develop further intuitions about the different optimization regimes. When we
encounter optimization difficulties in minimizing a training loss, there are several hyperparameters
we can tune to accelerate the training non-trivially. In contrast to Figure 1, we now focus on
one typical training curve Figure 2 Left where the x-axis represents the number of epochs trained.
It is a training curve from ResNet paper [He et al., 2016], and A is the curvature dominated
regime and B is the noise dominated regime. These terms are explained in the previous
question. Note that the learning rate schedule plot is also shown for completeness and explaining
the phenomenons that could happen in the given training curve. However, you should not worry
3
CSC413/2516 Winter 2022 with Professor Jimmy Ba and Professor Bo Wang Homework 3
about it when answering this question since your goal is to simply pick the solutions to curvature
dominated regime and noise dominated regime from following options.
Figure 2: Left: The cyan curve shows the training curve of ResNet-18 on Imagenet. The red
curve trains ResNet-34. [He et al., 2016] Regime A is dominated by the ill-conditioned landscape.
Regime B is dominated by the large amount of gradient noise. Right: Corresponding learning rate
schedule used for training ResNet.
I Use an adaptive, 2nd, or higher order optimizer
II (II+) Increase the batch size, (II-) Decrease the batch size
III (III+) Increase the learning rate, (III-) Decrease the learning rate
IV Use a normalization technique such as batch norm
(a) [1pt] Regime A is dominated by curvature in Figure 2. Which of the above options will accelerate
the training by addressing the ill-conditioned curvature.
(Select all that apply from I, II+, II-, III+, III-, IV)
(b) [1pt] Regime B is dominated by the noise in Figure 2. Which of the above options will accelerate
the training by addressing the large amount of gradient noise?
(Select all that apply from I, II+, II-, III+, III-, IV)
1.2 Model size, dataset size and compute
We have seen in the previous section that batch size and learning rate are important hyperparameter
to help with optimization. Besides efficiently minimizing the training loss, we are also interested
in the test loss. Recently, researchers have observed an intriguing relationship between the test
loss and hyperparameters such as the model size, dataset size and the amount of compute used.
4
CSC413/2516 Winter 2022 with Professor Jimmy Ba and Professor Bo Wang Homework 3
Figure 3: Test loss of language models of different sizes, plotted against the amount of compute
(in petaflop/s-days, similar to Fig 2 in [Kaplan et al., 2020]). You can think of the total compute
used, shown in x-axis, as number of training steps × number of parameters × constants.
[Kaplan et al., 2020] argues for a “Pareto frontier” of optimal allocation of compute, in terms of
model size, training steps, to reach a target test loss.
We explore this relationship for neural language models in this section. A similar phenomenon
has also been observed in computer vision models. In the same spirit as [Kaplan et al., 2020], we
plot the test loss of two models in Figure 3. It compares the test loss vs compute of the trained
neural language models with the different number of parameters and training steps.
Note: You can think of the total compute used, shown in x-axis, as number of training steps ×
number of parameters × constants.
(a) [1pt] There are two language models shown in Figure 3: Model A in green and Model B in purple.
We also highlight an intersection point “X” that denotes a target test loss achieved by both
models using the same total amount of compute. (1) Which of the two models has more
parameters? Why? (2) At the intersection “X”, which of the two models has been training
for more iterations/updates? Why?
Explain your answers in no more than 3 sentences.
(b) [1pt] Suppose you want to train a language model to reach a desired test loss.
In what situation, will you choose training one over the other (big/small model) if either can
give you the same desired test loss using the same total compute? Give a brief explanation.
For example, if you have an urgent deadline coming up, which one will require more wallclock
time to reach “X”? You can think of the total compute used, shown in x-axis, as number of
training steps×number of parameters×constants.
2 Generalization Error of Linear Regression
Today’s deep learning practices may seem more like alchemy than science, but, we can still obtain
useful insight about deep neural networks by studying their linear counterparts. In Homework 1,
5
CSC413/2516 Winter 2022 with Professor Jimmy Ba and Professor Bo Wang Homework 3
we asked you to derive the training loss of a noisy regression problem. Here we will learn to derive
the exact test loss, so we can concretely reason about the model’s generalization performance in
terms of the dataset size and the model size. Many recent works has shown empirically modern
deep learning models behave similarly to linear models under gradient descent [Lee et al., 2019].
Recall the linear regression model from homework 1:
wˆ = argmin
w∈Rd
1
n
Xn
i=1
(w⊤xi − ti)
2
. (2.1)
Denote X = [x1; x2; …xn] ∈ R
n×d as the training data matrix (design matrix), and t ∈ R
n as the
corresponding label vector. When X has full-rank, the solution of the above problem is given as
wˆ =
(
(X⊤X)
−1X⊤t, n > d,
X⊤(XX⊤)
−1
t, n < d.
(2.2)
We restrict ourselves to a linear teacher model with Gaussian features, i.e.,
ti = w⊤
∗ xi + εi
, xi
i.i.d. ∼ N (0, Id), (2.3)
where εi
is i.i.d. label noise with mean 0 and variance σ
2
, and w∗ ∈ R
d
is the true signal that does
not depend on x, ε. To further simplify the computation, we consider a Bayesian setting and place
an isotropic prior on the true signal4
: E[w∗w⊤
∗
] = 1
d
Id.
Our goal is to compute the generalization error (test loss) of the linear regression model:
R(wˆ ) = Ex˜,ε,w∗
(w⊤
∗ x˜ − wˆ
⊤x˜)
2
, (2.4)
where the test data is drawn from the same distribution: x˜ ∼ N (0, Id).
2.1 Bias-variance decomposition
(Review the concept on Bias-Variance Decomposition from CSC311 https://www.cs.toronto.
edu/~rgrosse/courses/csc311_f21/) We separate the generalization error into two terms: the
bias term, which measures how well we are learning the true parameters w∗, and the variance term,
which is due to “overfitting” the noise in the labels ε. For the following two questions, expand the
square in (2.4) and simplify the expressions using the expectation over x˜, ε, w∗.
2.1.1 [0.5pt]
When n > d (underparameterized regime), show that the test loss can be written as
R(wˆ ) = 0
|{z}
bias, B(wˆ )
+ σ
2 Tr
(X⊤X)
−1
| {z }
variance, V(wˆ )
. (2.5)
Hint: use the i.i.d. property of the training data and label noise: E[xx⊤] = Id, E[εε⊤] = σ
2
In,
where ε ∈ R
n
, [ε]i = εi.
4This does not mean that w∗ is random at training time – we take this expectation for the test loss.
6
CSC413/2516 Winter 2022 with Professor Jimmy Ba and Professor Bo Wang Homework 3
2.1.2 [0pt]
When n < d (overparameterized regime), show that the test loss can be written as
R(wˆ ) = 1
d
Tr
Id − X⊤(XX⊤)
−1X
| {z }
bias, B(wˆ )
+ σ
2 Tr
(XX⊤)
−1
| {z }
variance, V(wˆ )
. (2.6)
Hint: for the bias term, use the isotropic prior: E[w∗w⊤
∗
] = 1
d
Id.
2.2 Deriving the exact expressions
To derive the precise expressions of the error, we take expectation over the design matrix X (denote
the expected risk as E[R(wˆ )]). You may use the following facts:
• When d > n, Tr
Id − X⊤(XX⊤)
−1X
= d − n.
• Given Gaussian random matrix G ∈ R
m×p
, where [G]ij ∼ N (0, 1), we have
E
Tr
(G⊤G)
−1
=
p
m − p − 1
, if m > p + 1. (2.7)
Bonus: argue why this is the case, using properties of the χ
2
-distribution.
2.2.1 [1pt]
Simplify the expression of E[R(wˆ )] for both the under- and over-parameterized case. Your final
result should only involve n, d, σ, and you may ignore the regime n − 1 ≤ d ≤ n + 1.
2.2.2 Double descent [1pt]
Answer based on the formulas: (1) under what conditions (on n, d, σ) is it possible for the model
to achieve perfect generalization, i.e., R(wˆ ) = 0? (2) Does adding more training examples always
help generalization? Why? Briefly explain your answers in no more than 3 sentences.
2.2.3 [0pt]
Plot the expression you derived in the previous question with the following parameters: fix the
number of features d = 500, and vary training set size n from 100 to 1000 (make sure you exclude
the regime n − 1 ≤ d ≤ n + 1). Include your figures for σ = 0 (noiseless) and σ = 0.2 (noisy).
Answer the following questions based on the figure:
• What difference do you observe between the noiseless and noisy setting?
• Does adding more training data (increasing n) always lead to better test performance? If
not, under what conditions (on n, d, σ) is it beneficial? And when does it hurt?
2.3 Ridge regularization
To reduce overfitting to the label noise, we now consider regularizing the ℓ2-norm of the parameters.
For λ > 0, the regularized objective is given as
wˆ λ = argmin
w∈Rd
1
n
Xn
i=1
(w⊤xi − ti)
2 + λ∥w∥
2
2
, (2.8)