# MATH5473 A Mathematical Introduction to Data Science Homework 2. Random Matrix Theory and PCA

\$30.00

## Description

5/5 - (1 vote)

1. Phase transition in PCA “spike” model: Consider a finite sample of n i.i.d vectors x1, x2, . . . , xn
drawn from the p-dimensional Gaussian distribution N (0, σ2
Ip×p + λ0uuT
), where λ0/σ2
is
the signal-to-noise ratio (SNR) and u ∈ R
p
. In class we showed that the largest eigenvalue λ
of the sample covariance matrix Sn
Sn =
1
n
Xn
i=1
xix
T
i
pops outside the support of the Marcenko-Pastur distribution if
λ0
σ
2
>

γ,
or equivalently, if
SNR >
r
p
n
.
(Notice that √γ < (1 + √γ)
2
, that is, λ0 can be “buried” well inside the support MarcenkoPastur distribution and still the largest eigenvalue pops outside its support). All the following
questions refer to the limit n → ∞ and to almost surely values:
(a) Find λ given SNR >
√γ.
(b) Use your previous answer to explain how the SNR can be estimated from the eigenvalues
of the sample covariance matrix.
(c) Find the squared correlation between the eigenvector v of the sample covariance matrix
(corresponding to the largest eigenvalue λ) and the “true” signal component u, as a
function of the SNR, p and n. That is, find |hu, vi|2
.
(d) Confirm your result using MATLAB, Python, or R simulations (e.g. set u = e; and
choose σ = 1 and λ0 in different levels. Compute the largest eigenvalue and its associated
eigenvector, with a comparison to the true ones.)
1
Homework 2. Random Matrix Theory and PCA 2
2. Exploring S&P500 Stock Prices: Take the Standard & Poor’s 500 data:
https://github.com/yao-lab/yao-lab.github.io/blob/master/data/snp452-data.mat
which contains the data matrix X ∈ R
p×n of n = 1258 consecutive observation days and
p = 452 daily closing stock prices, and the cell variable “stock” collects the names, codes, and
the affiliated industrial sectors of the 452 stocks. Use Matlab, Python, or R for the following
exploration.
(a) Take the logarithmic prices Y = log X;
(b) For each observation time t ∈ {1, . . . , 1257}, calculate logarithmic price jumps
∆Yi,t = Yi,t − Yi,t−1, i ∈ {1, . . . , 452};
(c) Construct the realized covariance matrix Σˆ ∈ R
452×452 by,
Σˆ
i,j =
1
1257
1257
X
τ=1
∆Yi,τ∆Yj,τ ;
(d) Compute the eigenvalues (and eigenvectors) of Σ and store them in a descending order ˆ
by {λˆ
k, k = 1, . . . , p}.
(e) Horn’s Parallel Analysis: the following procedure describes a so-called Parallel Analysis
of PCA using random permutations on data. Given the matrix [∆Yi,t], apply random
permutations πi
: {1, . . . , t} → {1, . . . , t} on each of its rows: ∆Y˜
i,πi(j)
such that
[∆Y˜
π(i),t] =

∆Y1,1 ∆Y1,2 ∆Y1,3 . . . ∆Y1,t
∆Y2,π2(1) ∆Y2,π2(2) ∆Y2,π2(3) . . . ∆Y2,π2(t)
∆Y3,π3(1) ∆Y3,π3(2) ∆Y3,π3(3) . . . ∆Y3,π3(t)
. . . . . . . . . . . . . . .
∆Yn,πn(1) ∆Yn,πn(2) ∆Yn,πn(3) . . . ∆Yn,πn(t)

.
Define Σ = ˜ 1
t ∆Y˜ · ∆Y˜ T as the null covariance matrix. Repeat this for R times and
compute the eigenvalues of Σ˜
r for each 1 ≤ r ≤ R. Evaluate the p-value for each
estimated eigenvalue λˆ
k by (Nk+1)/(R+1) where Nk is the counts that λˆ
k is less than the
k-th largest eigenvalue of Σ˜
r over 1 ≤ r ≤ R. Eigenvalues with small p-values indicate
that they are less likely arising from the spectrum of a randomly permuted matrix
and thus considered to be signal. Draw your own conclusion with your observations
and analysis on this data. A reference is: Buja and Eyuboglu, ”Remarks on Parallel
Analysis”, Multivariate Behavioral Research, 27(4): 509-540, 1992.
3. *Finite rank perturbations of random symmetric matrices: Wigner’s semi-circle law (proved
by Eugene Wigner in 1951) concerns the limiting distribution of the eigenvalues of random
symmetric matrices. It states, for example, that the limiting eigenvalue distribution of n × n
symmetric matrices whose entries wij on and above the diagonal (i ≤ j) are i.i.d Gaussians
N (0,
1
4n
) (and the entries below the diagonal are determined by symmetrization, i.e., wji =
wij ) is the semi-circle:
p(t) = 2
π
p
1 − t
2, −1 ≤ t ≤ 1,
where the distribution is supported in the interval [−1, 1].
Homework 2. Random Matrix Theory and PCA 3
(a) Confirm Wigner’s semi-circle law using MATLAB, Python, or R simulations (take, e.g.,
n = 400).
(b) Find the largest eigenvalue of a rank-1 perturbation of a Wigner matrix. That is, find
the largest eigenvalue of the matrix
W + λ0uuT
,
where W is an n × n random symmetric matrix as above, and u is some deterministic
unit-norm vector. Determine the value of λ0 for which a phase transition occurs. What
is the correlation between the top eigenvector of W + λ0uuT and the vector u as a
function of λ0? Use techniques similar to the ones we used in class for analyzing finite
rank perturbations of sample covariance matrices.
[Some Hints about homework] For Wigner Matrix W = [wij ]n×n, wij = wji, wij ∼ N(0, √σ
n
),