Large Scale Matrix Computation, Optimization and Learning (EE270) HW #2

$30.00

Category: Tags: , , , , , , You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (7 votes)

1. Singular Value Decomposition (SVD) for Compression (20 pts)
Singular value decomposition (SVD) factorizes a m × n matrix X as X = UΣV
>, where
U ∈ R
m×m and U
>U = UU > = I, Σ ∈ R
m×n
contains non-increasing non-negative values
along its diagonal and zeros elsewhere, and V ∈ R
n×n and V
>V = V V > = I. Hence matrix
can be represented as X =
Pd
i=1 σiuiv
T
i where ui denotes the i
th column of U and vi denotes
the i
th column of V . Download the image1 and load it as the matrix X (in greyscale).
(a) Perform SVD on this matrix and zero out all but top k singular values to form an approximation X˜. Specifically, compute X˜ =
Pk
i=1 σiuiv
T
i
, display the resulting approximation
X˜ as an image, and report kX−X˜kF
kXkF
for k ∈ {2, 10, 40}.
(b) How many numbers do you need to describe the approximation X =
Pk
i=1 σiuiv
T
i
for
k ∈ {2, 10, 40} ?
Hint: Matlab can load the greyscale image using
X=double(rgb2gray(imread(’harvey-saturday-goes7am.jpg’)))
and display the image using imagesc()
In Python 2.7 you can use:
import numpy, Image
X=numpy.asarray(Image.open(’harvey.jpg’).convert(’L’))
1https://www.nasa.gov/sites/default/files/thumbnails/image/harvey-saturday-goes7am.jpg
(NOAA’s GOES-East satellite’s image of Hurricane Harvey in the western Gulf of Mexico on Aug. 26, 2017. Credit:
NASA/NOAA GOES Project)
2. Singular Value Decomposition and Subspaces (20 pts) Let A be an m × n singular
matrix of rank r with SVD
A = UΣV
T =


~u1 ~u2 . . . ~um




σ1
.
.
.
σr
0
.
.
.
0




~vT
1
~vT
2
.
.
.
~vT
n


=

Uˆ U˜



σ1
.
.
.
σr
0
.
.
.
0



Vˆ T
V˜ T

where σ1 ≥ . . . ≥ σr > 0, Uˆ consists of the first r columns of U, U˜ consists of the remaining
m − r columns of U, Vˆ consists of the first r columns of V , and V˜ consists of the remaining
n − r columns of V . Give bases for the spaces range(A), null(A), range(AT
) and null(AT
) in
terms of the components of the SVD of A, and a brief justification.
3. Least Squares (20 pts)
Consider the least squares problem minx ||b − Ax||2. Which of the following statements are
necessarily true?
(a) If x is a solution to the least squares problem, then Ax = b.
(b) If x is a solution to the least squares problem, then the residual vector r = b − Ax is in
the nullspace of AT
.
(c) The solution is unique.
(d) A solution may not exist.
(e) None of the above.
4. Binary Classification via Regression (20 pts)
Download the MNIST dataset from http://yann.lecun.com/exdb/mnist/.
In binary classification, we restrict Y to take on only two values. Suppose Y ∈ 0, 1. Now
let us use least squares linear regression for classification. Let us consider the classification
problem of recognizing if a digit is 2 or not using linear regression. Here, let Y = 1 for all the
2’s digits in the training set, and use Y = 0 for all other digits.
Build a linear classifier by minimizing:
min
w
1
N
X
N
i=1
(yi − w
T xi)
2 + λkwk
2
2
using the training set {(x1, y1), . . . ,(xn, yn)}. Use appropriate regularization (λ > 0) as
needed. Let xi be a vector of the pixel values of the image.
For the purpose of classification, we can label a digit as a 2 if w
T
z is larger than some
threshold
(a) Based on your training set, choose a reasonable threshold for classification. What is
your 0/1 loss on the training set? What is your square loss on the training set?
(b) On the test set, what is the 0/1 loss? What is the square loss?
5. Leave-One-Out Cross Validation (LOOCV) (20 pts)
Assume that there are n training examples, (x1, y1),(x2, y2), . . . ,(xn, yn), where each input
data point xi
, has d real valued features. The goal of regression is to learn to predict yi from
xi
. The linear regression model assumes that the output y is a linear combination of the
input features plus Gaussian noise with weights given by w.
The maximum likelihood estimate of the model parameters w (which also happens to minimize
the sum of squared prediction errors) is given by the normal equations we saw in class:
(XT X)
−1XT y. Here, the rows of the matrix X are the training examples x
T
1
, …, xT
n
. Define
Yˆ to be the vector predictions using ˆw if we were to plug in the original training set X:
yˆ = Xwˆ
= X(XT X)
−1XT
y
= Hy
where we define H = X(XT X)
−1XT
(H is often called the Hat Matrix). Show that H is a
projection matrix.
As mentioned above, ˆw, also minimizes the sum of squared errors:
SSE =
Xn
i=1
(yi − yˆi)
2
Leave-One-Out Cross Validation score is defined as:
LOOCV =
Xn
i=1
(yi − yˆ
(−i)
i
)
2
where ˆy
(−i)
is the estimator of y after removing the i
th observation from the training set, i.e.,
it minimizes
X
j6=i
(yj − yˆ
(−i)
j
)
2
.
(a) What is the time complexity of computing the LOOCV score naively? (The naive
algorithm is to loop through each point, performing a regression on the n − 1 remaining
points at each iteration.)
Hint: The complexity of matrix inversion using classical methods is O(k
3
) for a k × k
matrix.
(b) Write ˆyi
in terms of H and y.
(c) Show that ˆy
(−i)
is also the estimator which minimizes SSE for Z where
Zj =
(
yj , j 6= i

(−i)
i
, j = i
(d) Write ˆy
(−i)
i
in terms of H and Z. By definition, ˆy
(−i)
i = Zi
, but give an answer that is
analogous to (b).
(e) Show that ˆyi − yˆ
(−i)
i = Hiiyi − Hiiyˆ
(−i)
i
, where Hii denotes the i
th element along the
diagonal of H.
(f) Show that
LOOCV =
Xn
i=1

yi − yˆi
1 − Hii 2
What is the algorithmic complexity of computing the LOOCV score using this formula?
Note: We see from this formula that the diagonal elements of H somehow indicate the
impact that each particular observation has on the result of the regression.
6. Sketching Least Squares (20 pts) Here, we will consider the empirical performance of
random sampling and random projection algorithms for approximating least-squares.
Let A be an n × d matrix, with n d, b be an n-vector, and consider approximating the
solution to minx kAx − bk2. Generate the matrices A from one of three different classes of
distributions introduced below
• Generate a matrix A from multivariate normal N(1d, Σ), where the (i, j)th element of
Σij = 2 × 0.5
|i−j|
.(Refer to as GA data.)
• Generate a matrix A from multivariate t-distribution with 3 degree of freedom and
covariance matrix Σ as before. (Refer to as T3 data.)
• Generate a matrix A from multivariate t-distribution with 1 degree of freedom and
covariance matrix Σ as before. (Refer to as T1 data.)
To start, consider matrices of size n × d equal to 500 × 50.
(a) First, for each matrix, consider approximating the solution by randomly sampling a small
number r of rows/elements (i.e., constraints of the overconstrained least-squares problem) in one of three ways: uniformly at random; according to an importance sampling
distribution that is proportional to the Euclidean norms squared of the rows of A; and
according to an importance sampling distribution that is proportional to the leverage
scores of A. In each case, plot the error as a function of the number r of samples, paying
particular attention to two regimes: r = d, d + 1, d + 2, . . . , 2d; and r = 2d, 3d, . . .. Show
that the behavior of these three procedures is most similar for GA data, intermediate
for T3 data, and most different for T1 data; and explain why, and explain similarities
and differences.
(b) Next, for each matrix, consider approximating the solution by randomly projecting
rows/elements (i.e., constraints of the overconstrained least-squares problem) in one
of two ways: a random projection matrix in which each entry is i.i.d. {±1}, with appropriate variance; and a random projection matrix, in which each entry is i.i.d. Gaussian,
with appropriate variance. In each case, plot the error as a function of the number of
samples, i.e., dimensions on which the data are projected, paying particular attention
to two regimes: r = d, d + 1, d + 2, . . . , 2d; and r = 2d, 3d, . . .. Describe and explain
similarities and differences between these two procedures for GA data, T1 data, and T3
data.
(c) Finally, for each matrix, consider approximating the solution by randomly projecting
rows/elements (i.e., constraints of the overconstrained least-squares problem) with sparse
projection matrices. In particular, consider a random projection matrix, in which each
entry is i.i.d. either 0 or Gaussian, where the probability of 0 is q and the probability
of Gaussian is 1 − q. (Remember to rescale the variance of the Gaussian appropriately,
depending on q) For q varying from 0 to 1, in increments sufficiently small, plot the
error for solving the least-squares problem. Describe and explain how this varies as a
function of the number of samples, i.e., dimensions on which the data are projected,
paying particular attention to two regimes: r = d, d+ 1, d+ 2, . . . , 2d; and r = 2d, 3d, . . ..
Describe and explain similarities and differences between these three procedures for GA
data, T1 data, and T3 data.
Next, we describe how these behave for larger problems. To do so, we will work with
dense random projection algorithms in which the projection matrix consists of i.i.d. {±1}
random variables, with appropriate variance. Fix a value of d, and let n increase from
roughly 2d to roughly 100d. The exact value of d and n will depend on your machine,
your computational environment, etc., so that you get reasonably good low-precision
approximate solutions to the original least-squares problem. (You should expect d ≈ 500
should work; and if you can’t do the full plot to 100d, don’t worry, since the point is to
get large enough to illustrate the phenomena below.)
(d) Plot the running time of the random projection algorithm versus the running time of
solving the problem with a call to a QR decomposition routine provided by your system
as well as the running time of solving the problem with a call to an SVD routine provided
by your system. Illustrate that, for smaller problems the random projection methods
are not faster, but that for larger problems, the random projection methods are slightly
faster and/or can be used to solver larger problems than QR or SVD. Explain why is
this the case.