MATH5473 A Mathematical Introduction to Data Science Homework 5. SDP Extensions of PCA and MDS

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (1 vote)

1. RPCA: Construct a random rank-r matrix: let A ∈ R
m×n with aij ∼ N (0, 1) whose top-r
singular value/vector is λi
, ui ∈ R
m and vi ∈ R
n
(i = 1, . . . , r), define L =
Pr
i=1 uiv
T
i
. Construct a sparse matrix E with p percentage (p ∈ [0, 1]) nonzero entries distributed uniformly.
Then define
M = L + E.
(a) Set m = n = 20, r = 1, and p = 0.1, use Matlab toolbox CVX to formulate a semidefinite program for Robust PCA of M:
min
1
2
(trace(W1) + trace(W2)) + λkSk1 (1)
s.t. Lij + Sij = Xij , (i, j) ∈ E

W1 L
L
T W2

 0,
where you can use the matlab implementation in lecture notes as a reference;
(b) Choose different parameters p ∈ [0, 1] to explore the probability of successful recover;
(c) Increase r to explore the probability of successful recover;
(d) ?
Increase m and n to values beyond 50 will make CVX difficult to solve. In this
case, use the Augmented Lagrange Multiplier method, e.g. in E. J. Candes, X. Li, Y.
Ma, and J. Wright (2009) “Robust Principal Component Analysis?”. Journal of ACM,
58(1), 1-37. Make a code yourself (just a few lines of Matlab or Python) and test it for
m = n = 1000. A convergence criterion often used can be kM − Lˆ − SˆkF /kMkF ≤ 
( = 10−6
for example).
2. SPCA: Define three hidden factors:
V1 ∼ N (0, 290), V2 ∼ N (0, 300), V3 = −0.3V1 + 0.925V2 + ,  ∼ N (0, 1),
where V1, V2, and  are independent. Construct 10 observed variables as follows
Xi = Vj + 
j
i
, 
j
i ∼ N (0, 1),
with j = 1 for i = 1, . . . , 4, j = 2 for i = 5, . . . , 8, and j = 3 for i = 9, 10 and 
j
i
independent
for j = 1, 2, 3, i = 1, . . . , 10.
The first two principal components should be concentrated on (X1, X2, X3, X4) and (X5, X6, X7, X8),
respectively. This is an example given by H. Zou, T. Hastie, and R. Tibshirani, Sparse principal component analysis, J. Comput. Graphical Statist., 15 (2006), pp. 265-286.
1
Homework 5. SDP Extensions of PCA and MDS 2
(a) Compute the true covariance matrix Σ (and the sample covariance matrix with n examples, say n = 1000);
(b) Compute the top 4 principal components of Σ using eigenvector decomposition (by
Matlab or R);
(c) Use Matlab CVX toolbox to compute the first sparse principal component by solving
the SDP problem
max trace(ΣX) − λkXk1
s.t. trace(X) = 1
X  0
Choose λ = 0 and other positive numbers to compare your results with normal PCA;
(d) Remove the first sparse PCA from Σ and compute the second sparse PCA with the same
code;
(e) Again compute the 3rd and the 4th sparse PCA of Σ and compare them against the
normal PCAs.
(f) ? Construct an example with 200 observed variables which is hard to deal with by
CVX. In this case, try the Augmented Lagrange Multiplier method by Allen Yang et al.
(UC Berkeley) whose Matlab codes can be found at http://www.eecs.berkeley.edu/
~yang/software/SPCA/SPCA_ALM.zip, or Python scikit-learn Sparse PCA package.
3. ?Protein Folding: Consider the 3D structure reconstruction based on incomplete MDS with
uncertainty. Data file:
http://yao-lab.github.io/data/protein3D.zip
Figure 1: 3D graphs of file PF00018 2HDA.pdf (YES HUMAN/97-144, PDB 2HDA)
In the file, you will find 3D coordinates for the following three protein families:
PF00013 (PCBP1 HUMAN/281-343, PDB 1WVN),
Homework 5. SDP Extensions of PCA and MDS 3
PF00018 (YES HUMAN/97-144, PDB 2HDA), and
PF00254 (O45418 CAEEL/24-118, PDB 1R9H).
For example, the file PF00018 2HDA.pdb contains the 3D coordinates of alpha-carbons for a
particular amino acid sequence in the family, YES HUMAN/97-144, read as
VALYDYEARTTEDLSFKKGERFQIINNTEGDWWEARSIATGKNGYIPS
where the first line in the file is
97 V 0.967 18.470 4.342
Here
• ‘97’: start position 97 in the sequence
• ‘V’: first character in the sequence
• [x, y, z]: 3D coordinates in unit ˚A.
Figure 1 gives a 3D representation of its structure.
Given the 3D coordinates of the amino acids in the sequence, one can computer pairwise
distance between amino acids, [dij ]
l×l where l is the sequence length. A contact map is
defined to be a graph Gθ = (V, E) consisting l vertices for amino acids such that and edge
(i, j) ∈ E if dij ≤ θ, where the threshold is typically θ = 5˚A or 8˚A here.
Can you recover the 3D structure of such proteins, up to an Euclidean transformation (rotation
and translation), given noisy pairwise distances restricted on the contact map graph Gθ, i.e.
given noisy pairwise distances between vertex pairs whose true distances are no more than θ?
Design a noise model (e.g. Gaussian or uniformly bounded) for your experiments.
When θ = ∞ without noise, classical MDS will work; but for a finite θ with noisy measurements, SDP approach can be useful. You may try the matlab package SNLSDP by Kim-Chuan
Toh, Pratik Biswas, and Yinyu Ye, or the facial reduction speed up by Nathan Krislock and
Henry Wolkowicz.