## Description

1. RPCA: Construct a random rank-r matrix: let A ∈ R

m×n with aij ∼ N (0, 1) whose top-r

singular value/vector is λi

, ui ∈ R

m and vi ∈ R

n

(i = 1, . . . , r), define L =

Pr

i=1 uiv

T

i

. Construct a sparse matrix E with p percentage (p ∈ [0, 1]) nonzero entries distributed uniformly.

Then define

M = L + E.

(a) Set m = n = 20, r = 1, and p = 0.1, use Matlab toolbox CVX to formulate a semidefinite program for Robust PCA of M:

min

1

2

(trace(W1) + trace(W2)) + λkSk1 (1)

s.t. Lij + Sij = Xij , (i, j) ∈ E

W1 L

L

T W2

0,

where you can use the matlab implementation in lecture notes as a reference;

(b) Choose different parameters p ∈ [0, 1] to explore the probability of successful recover;

(c) Increase r to explore the probability of successful recover;

(d) ?

Increase m and n to values beyond 50 will make CVX difficult to solve. In this

case, use the Augmented Lagrange Multiplier method, e.g. in E. J. Candes, X. Li, Y.

Ma, and J. Wright (2009) “Robust Principal Component Analysis?”. Journal of ACM,

58(1), 1-37. Make a code yourself (just a few lines of Matlab or Python) and test it for

m = n = 1000. A convergence criterion often used can be kM − Lˆ − SˆkF /kMkF ≤

( = 10−6

for example).

2. SPCA: Define three hidden factors:

V1 ∼ N (0, 290), V2 ∼ N (0, 300), V3 = −0.3V1 + 0.925V2 + , ∼ N (0, 1),

where V1, V2, and are independent. Construct 10 observed variables as follows

Xi = Vj +

j

i

,

j

i ∼ N (0, 1),

with j = 1 for i = 1, . . . , 4, j = 2 for i = 5, . . . , 8, and j = 3 for i = 9, 10 and

j

i

independent

for j = 1, 2, 3, i = 1, . . . , 10.

The first two principal components should be concentrated on (X1, X2, X3, X4) and (X5, X6, X7, X8),

respectively. This is an example given by H. Zou, T. Hastie, and R. Tibshirani, Sparse principal component analysis, J. Comput. Graphical Statist., 15 (2006), pp. 265-286.

1

Homework 5. SDP Extensions of PCA and MDS 2

(a) Compute the true covariance matrix Σ (and the sample covariance matrix with n examples, say n = 1000);

(b) Compute the top 4 principal components of Σ using eigenvector decomposition (by

Matlab or R);

(c) Use Matlab CVX toolbox to compute the first sparse principal component by solving

the SDP problem

max trace(ΣX) − λkXk1

s.t. trace(X) = 1

X 0

Choose λ = 0 and other positive numbers to compare your results with normal PCA;

(d) Remove the first sparse PCA from Σ and compute the second sparse PCA with the same

code;

(e) Again compute the 3rd and the 4th sparse PCA of Σ and compare them against the

normal PCAs.

(f) ? Construct an example with 200 observed variables which is hard to deal with by

CVX. In this case, try the Augmented Lagrange Multiplier method by Allen Yang et al.

(UC Berkeley) whose Matlab codes can be found at http://www.eecs.berkeley.edu/

~yang/software/SPCA/SPCA_ALM.zip, or Python scikit-learn Sparse PCA package.

3. ?Protein Folding: Consider the 3D structure reconstruction based on incomplete MDS with

uncertainty. Data file:

http://yao-lab.github.io/data/protein3D.zip

Figure 1: 3D graphs of file PF00018 2HDA.pdf (YES HUMAN/97-144, PDB 2HDA)

In the file, you will find 3D coordinates for the following three protein families:

PF00013 (PCBP1 HUMAN/281-343, PDB 1WVN),

Homework 5. SDP Extensions of PCA and MDS 3

PF00018 (YES HUMAN/97-144, PDB 2HDA), and

PF00254 (O45418 CAEEL/24-118, PDB 1R9H).

For example, the file PF00018 2HDA.pdb contains the 3D coordinates of alpha-carbons for a

particular amino acid sequence in the family, YES HUMAN/97-144, read as

VALYDYEARTTEDLSFKKGERFQIINNTEGDWWEARSIATGKNGYIPS

where the first line in the file is

97 V 0.967 18.470 4.342

Here

• ‘97’: start position 97 in the sequence

• ‘V’: first character in the sequence

• [x, y, z]: 3D coordinates in unit ˚A.

Figure 1 gives a 3D representation of its structure.

Given the 3D coordinates of the amino acids in the sequence, one can computer pairwise

distance between amino acids, [dij ]

l×l where l is the sequence length. A contact map is

defined to be a graph Gθ = (V, E) consisting l vertices for amino acids such that and edge

(i, j) ∈ E if dij ≤ θ, where the threshold is typically θ = 5˚A or 8˚A here.

Can you recover the 3D structure of such proteins, up to an Euclidean transformation (rotation

and translation), given noisy pairwise distances restricted on the contact map graph Gθ, i.e.

given noisy pairwise distances between vertex pairs whose true distances are no more than θ?

Design a noise model (e.g. Gaussian or uniformly bounded) for your experiments.

When θ = ∞ without noise, classical MDS will work; but for a finite θ with noisy measurements, SDP approach can be useful. You may try the matlab package SNLSDP by Kim-Chuan

Toh, Pratik Biswas, and Yinyu Ye, or the facial reduction speed up by Nathan Krislock and

Henry Wolkowicz.