## Description

1. The squared-error cost function f(w) = ||y − Xw||2

2 = (y − Xw)

T

(y − Xw) may be

rewritten as a perfect square in the form

f(w) = (w − wLS)

TXTX(w − wLS) + c

where wLS = (XTX)

−1XT y and c = y

T y − y

TX(XTX)

−1XT y. This assumes the

n-by-p (p < n) matrix X is full rank. f(w) is called a “quadratic form” in w since it

is a quadratic function of w.

a) Prove that the minimum value of f(w) = c when w = wLS and all other w result

in higher values of f(w).

b) Suppose y =

1

1/2

1

0

and the 4-by-2 X = UΣV

T has singular value decomposition U =

1 0

0 1

0 0

0 0

, Σ =

”

1 0

0 1/2

#

, and V = I. Sketch a contour plot of f(w)

in the w1-w2 plane.

c) Suppose y =

1

1/5

1

0

, U =

1 0

0 1

0 0

0 0

, Σ =

”

1 0

0 1/5

#

, and V = I. Sketch a

contour plot of f(w) in the w1-w2 plane. How do the singular values of X affect

the shape of the contours?

d) In this case assume y =

√

2

0

1

0

, U =

1 0

0 1

0 0

0 0

, Σ =

”

1 0

0 1/2

#

, and V =

√

1

2

”

1 1

1 −1

#

. Sketch a contour plot of f(w) in the w1-w2 plane. How do the

right singular vectors of X affect the contours?

e) Sketch the gradient of f(w) at w = wo on the contour plot for the previous case

when

i. wo =

”

2

2

#

ii. wo =

”

0

2

#

iii. wo =

”

2

1

#

2. The provided script will compute a specified number of iterations of the gradient descent algorithm and help you display the path taken by the weights in the gradient

descent iteration superimposed on a contour plot. Assume y =

√

2

0

1

0

, the 4-by-2

X = UΣV

T has singular value decomposition U =

1 0

0 1

0 0

0 0

, Σ =

”

1 0

0 1/2

#

, and

V = √

1

2

”

1 1

1 −1

#

. Complete 20 iterations of gradient descent in each case specified

below.

Include the plots you generate below with your submission.

a) What is the maximum value for the step size τ that will guarantee convergence?

b) Start gradient descent from the point w =

”

1.5

−0.5

#

and use a step size of τ = 0.5.

How do you explain the trajectory the weights take toward the optimum, e.g., why

is it shaped this way? What would the trajectory be if you started from the point

w =

”

0

0

#

? From w =

”

0

2

#

?

c) Start gradient descent from the point w =

”

1.5

−0.5

#

and use a step size of τ = 2.5.

Complete 20 iterations. What happens and why?

d) Now change Σ =

”

1 0

0 1/4

#

, start from w =

”

1.5

−0.5

#

and use τ = 0.5. What

happens to the cost function? How does the change in the cost function lead to

a change in the trajectory of the gradient descent weights? What would happen

if you further decreased the smaller singular value?

e) Discuss how changing the ratio of the singular values changes the shape of the cost

function and how that might affect the number of iterations it takes for gradient

descent to get close to the optimum.