## Description

1. Gradient descent. [5pts] We can get quite a bit of insight into the behavior of gradient

descent by looking at how it behaves on quadratic functions. Suppose we are trying to

optimize a quadratic function

C(θ) = a1

2

(θ1 − r1)

2 + · · · +

aN

2

(θN − rN )

2

,

with each ai > 0. We can exactly solve for the dynamics of gradient descent. In other words,

we can find an exact formula for θ

(t)

i

, where t is the number of gradient descent updates.

(a) [1pt] Derive the gradient descent update rule for each θi with learning rate α. It should

have the form

θ

(t+1)

i = · · · ,

where the right-hand side is some function of the previous value θ

(t)

i

, as well as ri

, ai

,

and α. (It’s an interesting and useful fact that the different θi

’s evolve independently,

so we can analyze a single coordinate at a time.)

(b) [2pts] Now let’s introduce the error e

(t)

i = θ

(t)

i − ri

. Take your update rule from the

previous part, and write a recurrence for the errors. It should have the form

e

(t+1)

i = · · · ,

where the right-hand side is some function of e

(t)

i

, ai

, and α.

(c) [1pt] Solve this recurrence to obtain an explicit formula for e

(t)

i

in terms of the initial

error e

(0)

i

. For what values of α is the procedure stable (the errors decay over time), and

for what values is it unstable (the errors grow over time)?

Aside: your answer will indicate that large learning rates are more unstable than small

ones, and that high curvature dimensions are more unstable than low curvature ones.

(d) [1pt] Using your answer for the previous part, write an explicit formula for the cost

C(θ

(t)

) as a function of the initial values θ

(0). (You can write it as a summation over

indices, i.e., you don’t need to vectorize it.) As t → ∞, which term comes to dominate?

1

https://markus.teach.cs.toronto.edu/csc321-2018-01

2

http://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/syllabus.pdf

1

CSC321 Homework 4

Aside: you’ll find that if you use the optimal α, the asymptotic behavior roughly depends

on the condition number

κ =

maxi ai

mini ai

.

This supports the claim that narrow ravines are problematic for gradient descent.

(e) [Optional (not marked), and advanced] This part is optional, but you may find

it interesting. We’ll make use of eigendecompositions of symmetric matrices; see MIT

OpenCourseware for a refresher:

https://ocw.mit.edu/courses/mathematics/18-06sc-linear-algebra-fall-2011/positive-definite-matrices-and-applications/

symmetric-matrices-and-positive-definiteness/

It turns out we’ve actually just analyzed the fully general quadratic case. I.e., suppose

we try to minimize a cost function of the form

C(θ) = 1

2

(θ − r)

T A(θ − r),

where A is a symmetric positive definite matrix, i.e. a symmetric matrix with all positive

eigenvalues. (This is the general form for a quadratic function which curves upwards.)

Determine the gradient descent update for θ in vectorized form. Then write a recurrence

for the error vector e = θ − r, similarly to Part (1b). It will have the form

e

(t+1) = Be(t)

,

where B is a symmetric matrix. Determine the eigenvectors and eigenvalues of B in

terms of the eigenvectors and eigenvalues of A, and use this to find an explicit form for

e

(t) and for C(θ

(t)

) in terms of θ

(0). The result will be closely related to your answer

from Part (1d).

2. Dropout. [5pts] For this question, you may wish to review the properties of expectation

and variance: https://metacademy.org/graphs/concepts/expectation_and_variance

Dropout has an interesting interpretation in the case of linear regression. Recall that the

predictions are made stochastically as:

y =

X

j

mjwjxj ,

where the mj ’s are all i.i.d. (independnet and identically distributed) Bernoulli random variables with expectation 1/2. (I.e., they are indepdendent for every input dimension and every

data point.) We would like to minimize the cost

E =

1

2N

X

N

i=1

E[(y

(i) − t

(i)

)

2

], (1)

where the expectation is with respect to the m

(i)

j

’s.

Now we show that this is equivalent to a regularized linear regression problem:

(a) [2pts] Find expressions for E[y] and Var[y] for a given x and w.

2

CSC321 Homework 4

(b) [1pt] Determine ˜wj as a function of wj such that

E[y] = ˜y =

X

j

w˜jxj .

Here, ˜y can be thought of as (deterministic) predictions made by a different model.

(c) [2pts] Using the model from the previous section, show that the cost E (Eqn. 1) can be

written as

E =

1

2N

X

N

i=1

(˜y

(i) − t

(i)

)

2 + R( ˜w1, . . . , w˜D),

where R is a function of the ˜wD’s which does not involve an expectation. I.e., give an

expression for R. (Note that R will depend on the data, so we call it a “data-dependent

regularizer.”)

Hint: write the cost in terms of the mean and variance formulas from part (a). For

inspiration, you may wish to refer to the derivation of the bias/variance decomposition

from Lecture 9.

3