Large Scale Matrix Computation, Optimization and Learning (EE270) HW #1

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (7 votes)

1. Linear Algebra (15 pts)
Are the following statements true or false? If true, prove it; if false, show a counterexample.
(a) The inverse of a symmetric matrix is itself symmetric.
(b) All 2 × 2 orthogonal matrices have the following form

cos θ − sin θ
sin θ cos θ

or 
cos θ sin θ
sin θ − cos θ

.
(c) Let A =


−8 −1 −6
−3 −5 −7
−4 −9 −2

. A can be written as A = CCT
for some matrix C.
2. Divide and Conquer Matrix Multiplication (15 pts)
If A is a matrix, then A2 = AA is the square of A.
(a) Show that five multiplications are sufficient to compute the square of a 2 × 2 matrix
A =

a1 a2
a3 a4

, where a1, a2, a3, a4 are scalars.
(b) Generalize the formula in part (a) to a 2 × 2 block matrix A =

A1 A2
A3 A4

where
A1, A2, A3, A4 are arbitrary matrices.
(c) Instead of using the classical matrix multiplication (three-loop) algorithm for computing
A2
, we may apply the block formula you derived in (b) to reduce 2n × 2n problems to
several n × n computations, which can be tackled with classical matrix multiplication.
Compare the total number of arithmetic operations. Generate 2n×2n random A matrices
and plot the wall-clock time of the classical matrix multiplication algorithm and the
algorithm using the formula in (b) to compute A2
for n = 4, …, 10000 (or as large as
your system memory allows). You can use standard packages for matrix multiplication,
e.g., numpy.matmul.
(d) Show that if you have an algorithm for squaring an n×n matrix in O(n
c
) time, then you
can use it to multiply any two arbitrary n × n matrices in O(n
c
) time. [Hint: Consider
multiplying two matrices A and B. Can you define a matrix whose square contains AB?]
3. Probability (30 pts)
(a) Random variables X and Y have a joint distribution p(x, y). Prove the following results.
You can assume continuous distributions for simplicity.
i. E[X] = EY [EX[X|Y ]]
ii. E[I[X ∈ C]] = P(X ∈ C), where I[X ∈ C] is the indicator function1 of an arbitrary
set C.
iii. var[X] = EY [varX[X|Y ]] + varY [EX[X|Y ]]
iv. If X and Y are independent, then E[XY ] = E[X]E[Y ].
v. If X and Y take values in {0, 1} and E[XY ] = E[X]E[Y ], then X and Y are independent.
(b) Show that the approximate randomized counting algorithm described in Lemma 1 of
Lecture 2 slides (page 14) is unbiased:
En˜ = n . (1)
(c) Prove the variance formula in Lemma 2 of Lecture 2 slides (page 38) for Approximate
Matrix Multiplication AB ≈ CR
Var [(CR)ij ] = 1
m
X
d
k=1
A2
ikB2
kj
pk

1
m
(AB)
2
ij . (2)
where {pk}
d
k=1 are sampling probabilities.
4. Positive (Semi-)Definite Matrices (15 pts)
Let A be a real, symmetric d × d matrix. We say A is positive semi-definite (PSD) if,
for all x ∈ R
d
, x
>Ax ≥ 0. We say A is positive definite (PD) if, for all x 6= 0, x
>Ax > 0. We
write A  0 when A is PSD, and A  0 when A is PD.
The spectral theorem says that every real symmetric matrix A can be expressed A = UΛU
>,
where U is a d × d matrix such that UU > = U
>U = I (called an orthogonal matrix), and
Λ = diag(λ1, …, λd). Multiplying on the right by U we see that AU = UΛ. If we let ui denote
the i
th column of U, we have Aui = λiui for each i. This expression reveals that the λi are
eigenvalues of A, and the corresponding columns ui are eigenvectors associated to λi
.
(a) A is PSD iff λi ≥ 0 for each i.
(b) A is PD iff λi > 0 for each i.
Hint: Use the following representation
UΛU
T =
X
d
i=1
λiuiui
T
.
1
I[X ∈ C] := 1 if X ∈ C and 0 otherwise
5. Norms (20 pts)
(a) For p = 1, 2, ∞, verify that the functions k · kp are norms. Then, for a vector x ∈ R
n
,
show that
kxk∞ ≤ kxk2 ≤ kxk1 ≤

nkxk2 ≤ nkxk∞
and for each inequlaity, provide an example demonstrating that the inequality can be
tight.
(b) For vectors x,y ∈ R
n
, show that |x
T y| ≤ kx||2kyk2 with equality if and only if x and
y are linearly dependent. More generally, show that x
T y ≤ kxk1kyk∞. Note that this
implies that kxk
2
2 ≤ kxk1kxk∞; and that these are special cases of H¨older’s inequality.
(c) For A ∈ R
m×n
, show that Trace(AT A) = P
ij A2
ij , and show that qP
ij A2
ij is a norm
on m×n matrices. This is the Frobenius norm, denoted k· kF . Show that, in addition to
satisfying the definining properties of a norm, the Frobenius norm is a submultiplicative
norm, in that
kABkF ≤ kAkF kBkF
whenever the dimensions are such that the product AB is defined.
(d) Recall the definiton of the spectral norm of an m×n matrix A : kAk2 =
p
λmax(AT A) =
σmax(A), where λmax(AT A) is the largest eigenvalue pf AT A and σmax is the largest
singular value of A. Show that the Frobenius norm and the spectral norm are unitarily
invariant: if U and V are unitary (orthogonal in the real case) matrices, then kU
T AV kξ =
kAkξ, for ξ = 2, F.
6. Approximate Matrix Multiplication (20 pts)
Here, we will consider the empirical performance of random sampling and random projection algorithms for approximating the product of two matrices. You may use Matlab, or C,
or R, or any other software package you prefer to do your implementations. Please be sure
to describe what you used in sufficient detail that someone else could reproduce your results.
Let A be an n × d matrix, with n  d, and consider approximating the product AT A. First,
generate the matrices A from one of three different classes of distributions introduced below.
• Generate a matrix A from multivariate normal N(1d, Σ), where the (i, j)th element of
Σij = 2 × 0.5
|i−j|
.(Refer to as GA data.)
• Generate a matrix A from multivariate t-distribution with 3 degree of freedom and
covariance matrix Σ as before. (Refer to as T3 data.)
• Generate a matrix A from multivariate t-distribution with 1 degree of freedom and
covariance matrix Σ as before. (Refer to as T1 data.)
To start, consider matrices of size n×d equal to 500×50. (So, you should have three matrices,
one matrix A generated in each of the above ways.)
(a) For each matrix, approximate the product (AT A) with the random sampling algorithm
we discussed in class, i.e., by sampling with respect to a probability distribution that
depends on the norm squared of the rows of the input matrix. Plot the probability
distribution. Does it look uniform or nonuniform? Plot the performance of the spectral
and Frobenius norm error as a function of the number of samples.
(b) For each matrix, approximate the product (AT A) with the random sampling algorithm
we discussed in class, except that the uniform distribution, rather than the norm-squared
distribution, should be used to construct the random sample. Plot the performance of
the spectral and Frobenius norm error as a function of the number of samples. For
which matrices are the results similar and for which are they different than when the
norm-squared distribution is used ?
(c) Now you will implement the matrix approximation technique on the MNIST dataset
for handwritten digit classification. Details about MNIST dataset can be found at
http://yann.lecun.com/exdb/mnist/. We provide the dataset in .mat file so that
you can easily import it into Matlab by using load(’mnist matrix.mat’). To import
the dataset in Python you can use:
import scipy.io
data = scipy.io.loadmat(‘mnist matrix.mat’)
In .mat file you will find one matrix, A ∈ R
60000×784. For this matrix, approximate
the product (AT A) with the random sampling algorithm we discussed in class, i.e., by
sampling with respect to a probability distribution that depends on the norm squared
of the rows of the input matrix. Plot the probability distribution. Dos it look uniform
or nonuniform? Plot the performance of the spectral and Frobenius norm error as a
function of the number of samples.