Description
1. Gradient descent. [5pts] We can get quite a bit of insight into the behavior of gradient
descent by looking at how it behaves on quadratic functions. Suppose we are trying to
optimize a quadratic function
C(θ) = a1
2
(θ1 − r1)
2 + · · · +
aN
2
(θN − rN )
2
,
with each ai > 0. We can exactly solve for the dynamics of gradient descent. In other words,
we can find an exact formula for θ
(t)
i
, where t is the number of gradient descent updates.
(a) [1pt] Derive the gradient descent update rule for each θi with learning rate α. It should
have the form
θ
(t+1)
i = · · · ,
where the right-hand side is some function of the previous value θ
(t)
i
, as well as ri
, ai
,
and α. (It’s an interesting and useful fact that the different θi
’s evolve independently,
so we can analyze a single coordinate at a time.)
(b) [2pts] Now let’s introduce the error e
(t)
i = θ
(t)
i − ri
. Take your update rule from the
previous part, and write a recurrence for the errors. It should have the form
e
(t+1)
i = · · · ,
where the right-hand side is some function of e
(t)
i
, ai
, and α.
(c) [1pt] Solve this recurrence to obtain an explicit formula for e
(t)
i
in terms of the initial
error e
(0)
i
. For what values of α is the procedure stable (the errors decay over time), and
for what values is it unstable (the errors grow over time)?
Aside: your answer will indicate that large learning rates are more unstable than small
ones, and that high curvature dimensions are more unstable than low curvature ones.
(d) [1pt] Using your answer for the previous part, write an explicit formula for the cost
C(θ
(t)
) as a function of the initial values θ
(0). (You can write it as a summation over
indices, i.e., you don’t need to vectorize it.) As t → ∞, which term comes to dominate?
1
https://markus.teach.cs.toronto.edu/csc321-2018-01
2
http://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/syllabus.pdf
1
CSC321 Homework 4
Aside: you’ll find that if you use the optimal α, the asymptotic behavior roughly depends
on the condition number
κ =
maxi ai
mini ai
.
This supports the claim that narrow ravines are problematic for gradient descent.
(e) [Optional (not marked), and advanced] This part is optional, but you may find
it interesting. We’ll make use of eigendecompositions of symmetric matrices; see MIT
OpenCourseware for a refresher:
https://ocw.mit.edu/courses/mathematics/18-06sc-linear-algebra-fall-2011/positive-definite-matrices-and-applications/
symmetric-matrices-and-positive-definiteness/
It turns out we’ve actually just analyzed the fully general quadratic case. I.e., suppose
we try to minimize a cost function of the form
C(θ) = 1
2
(θ − r)
T A(θ − r),
where A is a symmetric positive definite matrix, i.e. a symmetric matrix with all positive
eigenvalues. (This is the general form for a quadratic function which curves upwards.)
Determine the gradient descent update for θ in vectorized form. Then write a recurrence
for the error vector e = θ − r, similarly to Part (1b). It will have the form
e
(t+1) = Be(t)
,
where B is a symmetric matrix. Determine the eigenvectors and eigenvalues of B in
terms of the eigenvectors and eigenvalues of A, and use this to find an explicit form for
e
(t) and for C(θ
(t)
) in terms of θ
(0). The result will be closely related to your answer
from Part (1d).
2. Dropout. [5pts] For this question, you may wish to review the properties of expectation
and variance: https://metacademy.org/graphs/concepts/expectation_and_variance
Dropout has an interesting interpretation in the case of linear regression. Recall that the
predictions are made stochastically as:
y =
X
j
mjwjxj ,
where the mj ’s are all i.i.d. (independnet and identically distributed) Bernoulli random variables with expectation 1/2. (I.e., they are indepdendent for every input dimension and every
data point.) We would like to minimize the cost
E =
1
2N
X
N
i=1
E[(y
(i) − t
(i)
)
2
], (1)
where the expectation is with respect to the m
(i)
j
’s.
Now we show that this is equivalent to a regularized linear regression problem:
(a) [2pts] Find expressions for E[y] and Var[y] for a given x and w.
2
CSC321 Homework 4
(b) [1pt] Determine ˜wj as a function of wj such that
E[y] = ˜y =
X
j
w˜jxj .
Here, ˜y can be thought of as (deterministic) predictions made by a different model.
(c) [2pts] Using the model from the previous section, show that the cost E (Eqn. 1) can be
written as
E =
1
2N
X
N
i=1
(˜y
(i) − t
(i)
)
2 + R( ˜w1, . . . , w˜D),
where R is a function of the ˜wD’s which does not involve an expectation. I.e., give an
expression for R. (Note that R will depend on the data, so we call it a “data-dependent
regularizer.”)
Hint: write the cost in terms of the mean and variance formulas from part (a). For
inspiration, you may wish to refer to the derivation of the bias/variance decomposition
from Lecture 9.
3