# MATH5473 A Mathematical Introduction to Data Science Homework 4. Random Projections

\$30.00

## Description

5/5 - (1 vote)

1. SNPs of World-wide Populations: This dataset contains a data matrix X ∈ R
p = 650, 000 columns of SNPs (Single Nucleid Polymorphisms) and n = 1064 rows of peoples
around the world (but there are 21 rows mostly with missing values). Each element is of
three choices, 0 (for ‘AA’), 1 (for ‘AC’), 2 (for ‘CC’), and some missing values marked by 9.
which is big (151MB in zip and 2GB original txt). Moreover, the following file contains
the region where each people comes from, as well as two variables ind1 andind2 such that
X(ind1, ind2) removes all missing values.
https://github.com/yao-lab/yao-lab.github.io/blob/master/data/HGDP_region.mat
Another cleaned dataset is due to Quanhua MU and Yoonhee Nam:
• Genotyped data of the 1043 (n) subjects. 0(AA), 1(AC), 2(CC). Missing values are
removed, only autosomal SNPs were selected (p ≈ 400K). Google drive link:
sharing
sharing
A good reference for this data can be the following paper in Science,
http://www.sciencemag.org/content/319/5866/1100.abstract
Explore the genetic variation of those persons with their geographic variations, by MDS/PCA.
Since p is big, explore random projections for dimensionality reduction.
2. Phase Transition in Compressed Sensing: Let A ∈ R
n×d be a Gaussian random matrix, i.e.
Aij ∼ N (0, 1). In the following experiments, fix d = 20. For each n = 1, . . . , d, and each
k = 1, . . . , d, repeat the following procedure 50 times:
1
Homework 4. Random Projections 2
(a) Construct a sparse vector x0 ∈ R
d with k nonzero entries. The locations of the nonzero
entries are selected at random and each nonzero equals ±1 with equal probability;
(b) Draw a standard Gaussian random matrix A ∈ R
n×d
, and set b = Ax0;
(c) Solve the following linear programming problem to obtain an optimal point ˆx,
minx kxk1 := X|xi
|
s.t. Ax = b,
for example, matlab toolbox cvx can be an easy solver;
(d) Declare success if kxˆ − x0k ≤ 10−3
;
After repeating 50 times, compute the success probability p(n, k); draw a figure with x-axis
for k and y-axis for n, to visualize the success probability. For example, matlab command
imagesc(p) can be a choice.
Can you try to give an analysis of the phenomenon observed? The following paper by Tropp
et al. may give you a good starting point to think.
• Dennis Amelunxen, Martin Lotz, Michael B. McCoy, Joel A. Tropp. Living on the
edge: Phase transitions in convex programs with random data. arXiv:1303.6672. URL:
https://arxiv.org/abs/1303.6672