## Description

1. Phase transition in PCA “spike” model: Consider a finite sample of n i.i.d vectors x1, x2, . . . , xn

drawn from the p-dimensional Gaussian distribution N (0, σ2

Ip×p + λ0uuT

), where λ0/σ2

is

the signal-to-noise ratio (SNR) and u ∈ R

p

. In class we showed that the largest eigenvalue λ

of the sample covariance matrix Sn

Sn =

1

n

Xn

i=1

xix

T

i

pops outside the support of the Marcenko-Pastur distribution if

λ0

σ

2

>

√

γ,

or equivalently, if

SNR >

r

p

n

.

(Notice that √γ < (1 + √γ)

2

, that is, λ0 can be “buried” well inside the support MarcenkoPastur distribution and still the largest eigenvalue pops outside its support). All the following

questions refer to the limit n → ∞ and to almost surely values:

(a) Find λ given SNR >

√γ.

(b) Use your previous answer to explain how the SNR can be estimated from the eigenvalues

of the sample covariance matrix.

(c) Find the squared correlation between the eigenvector v of the sample covariance matrix

(corresponding to the largest eigenvalue λ) and the “true” signal component u, as a

function of the SNR, p and n. That is, find |hu, vi|2

.

(d) Confirm your result using MATLAB, Python, or R simulations (e.g. set u = e; and

choose σ = 1 and λ0 in different levels. Compute the largest eigenvalue and its associated

eigenvector, with a comparison to the true ones.)

1

Homework 2. Random Matrix Theory and PCA 2

2. Exploring S&P500 Stock Prices: Take the Standard & Poor’s 500 data:

https://github.com/yao-lab/yao-lab.github.io/blob/master/data/snp452-data.mat

which contains the data matrix X ∈ R

p×n of n = 1258 consecutive observation days and

p = 452 daily closing stock prices, and the cell variable “stock” collects the names, codes, and

the affiliated industrial sectors of the 452 stocks. Use Matlab, Python, or R for the following

exploration.

(a) Take the logarithmic prices Y = log X;

(b) For each observation time t ∈ {1, . . . , 1257}, calculate logarithmic price jumps

∆Yi,t = Yi,t − Yi,t−1, i ∈ {1, . . . , 452};

(c) Construct the realized covariance matrix Σˆ ∈ R

452×452 by,

Σˆ

i,j =

1

1257

1257

X

τ=1

∆Yi,τ∆Yj,τ ;

(d) Compute the eigenvalues (and eigenvectors) of Σ and store them in a descending order ˆ

by {λˆ

k, k = 1, . . . , p}.

(e) Horn’s Parallel Analysis: the following procedure describes a so-called Parallel Analysis

of PCA using random permutations on data. Given the matrix [∆Yi,t], apply random

permutations πi

: {1, . . . , t} → {1, . . . , t} on each of its rows: ∆Y˜

i,πi(j)

such that

[∆Y˜

π(i),t] =

∆Y1,1 ∆Y1,2 ∆Y1,3 . . . ∆Y1,t

∆Y2,π2(1) ∆Y2,π2(2) ∆Y2,π2(3) . . . ∆Y2,π2(t)

∆Y3,π3(1) ∆Y3,π3(2) ∆Y3,π3(3) . . . ∆Y3,π3(t)

. . . . . . . . . . . . . . .

∆Yn,πn(1) ∆Yn,πn(2) ∆Yn,πn(3) . . . ∆Yn,πn(t)

.

Define Σ = ˜ 1

t ∆Y˜ · ∆Y˜ T as the null covariance matrix. Repeat this for R times and

compute the eigenvalues of Σ˜

r for each 1 ≤ r ≤ R. Evaluate the p-value for each

estimated eigenvalue λˆ

k by (Nk+1)/(R+1) where Nk is the counts that λˆ

k is less than the

k-th largest eigenvalue of Σ˜

r over 1 ≤ r ≤ R. Eigenvalues with small p-values indicate

that they are less likely arising from the spectrum of a randomly permuted matrix

and thus considered to be signal. Draw your own conclusion with your observations

and analysis on this data. A reference is: Buja and Eyuboglu, ”Remarks on Parallel

Analysis”, Multivariate Behavioral Research, 27(4): 509-540, 1992.

3. *Finite rank perturbations of random symmetric matrices: Wigner’s semi-circle law (proved

by Eugene Wigner in 1951) concerns the limiting distribution of the eigenvalues of random

symmetric matrices. It states, for example, that the limiting eigenvalue distribution of n × n

symmetric matrices whose entries wij on and above the diagonal (i ≤ j) are i.i.d Gaussians

N (0,

1

4n

) (and the entries below the diagonal are determined by symmetrization, i.e., wji =

wij ) is the semi-circle:

p(t) = 2

π

p

1 − t

2, −1 ≤ t ≤ 1,

where the distribution is supported in the interval [−1, 1].

Homework 2. Random Matrix Theory and PCA 3

(a) Confirm Wigner’s semi-circle law using MATLAB, Python, or R simulations (take, e.g.,

n = 400).

(b) Find the largest eigenvalue of a rank-1 perturbation of a Wigner matrix. That is, find

the largest eigenvalue of the matrix

W + λ0uuT

,

where W is an n × n random symmetric matrix as above, and u is some deterministic

unit-norm vector. Determine the value of λ0 for which a phase transition occurs. What

is the correlation between the top eigenvector of W + λ0uuT and the vector u as a

function of λ0? Use techniques similar to the ones we used in class for analyzing finite

rank perturbations of sample covariance matrices.

[Some Hints about homework] For Wigner Matrix W = [wij ]n×n, wij = wji, wij ∼ N(0, √σ

n

),

the answer is

eigenvalue is λ = R +

1

R

eigenvector satisfies (u

T vˆ)

2 = 1 −

1

R2