Description
1 Clustering
1.1 K-means Clustering (14 points)
1. (6 Points) Given n observations Xn
1 = {X1, . . . , Xn}, Xi ∈ X , the K-means objective is to find k (< n)
centres µ
k
1 = {µ1, . . . , µk}, and a rule f : X → {1, . . . , K} so as to minimize the objective
J(µ
K
1
, f; Xn
1
) = Xn
i=1
X
K
k=1
1(f(Xi) = k)kXi − µkk
2
(1)
Let JK(Xn
1
) = minµk
1
,f J(µ
K
1
, f; Xn
1
). Prove that JK(Xn
1
) is a non-increasing function of K.
2. (8 Points) Consider the K-means (Lloyd’s) clustering algorithm we studied in class. We terminate the
algorithm when there are no changes to the objective. Show that the algorithm terminates in a finite number
of steps.
1.2 Experiment (20 Points)
In this question, we will evaluate K-means clustering and GMM on a simple 2 dimensional problem. First, create a
two-dimensional synthetic dataset of 300 points by sampling 100 points each from the three Gaussian distributions
shown below:
Pa = N
−1
−1
, σ
2, 0.5
0.5, 1
, Pb = N
1
−1
, σ
1, −0.5
−0.5, 2
, Pc = N
0
1
, σ
1 0
0, 2
Here, σ is a parameter we will change to produce different datasets.
• First implement K-means clustering and the expectation maximization algorithm for GMMs. Execute both
methods on five synthetic datasets, generated as shown above with σ ∈ {0.5, 1, 2, 4, 8}. Finally, evaluate
both methods on (i) the clustering objective (??) and (ii) the clustering accuracy. For each of the two criteria,
plot the value achieved by each method against σ.
• Both algorithms are only guaranteed to find only a local optimum so we recommend trying multiple restarts
and picking the one with the lowest objective value (This is (??) for K-means and the negative log likelihood
for GMMs). You may also experiment with a smart initialization strategy (such as kmeans++).
• To plot the clustering accuracy, you may treat the ‘label’ of points generated from distribution Pu as u,
where u ∈ {a, b, c}. Assume that the cluster id i returned by a method is i ∈ {1, 2, 3}. Since clustering is
an unsupervised learning problem, you should obtain the best possible mapping from {1, 2, 3} to {a, b, c}
to compute the clustering objective. One way to do this is to compare the clustering centers returned by the
method (centroids for K-means, means for GMMs) and map them to the distribution with the closest mean.
Points break down: 7 points each for implementation of each method, 6 points for reporting of evaluation metrics.
2 Linear Dimensionality Reduction
2.1 Principal Components Analysis (10 points)
Principal Components Analysis (PCA) is a popular method for linear dimensionality reduction. PCA attempts
to find a lower dimensional subspace such that when you project the data onto the subspace as much of the
information is preserved. Say we have data X = [x
>
1
; . . . ; x
>
n
] ∈ R
n×D where xi ∈ R
D. We wish to find a d
(< D) dimensional subspace A = [a1, . . . , ad] ∈ R
D×d
, such that ai ∈ R
D and A>A = Id, so as to maximize
1
n
Pn
i=1 kA>xik
2
.
1. (4 Points) Suppose we wish to find the first direction a1 (such that a
>
1 a1 = 1) to maximize 1
n
P
i
(a
>
1 Xi)
2
.
Show that a1 is the first right singular vector of X.
2. (6 Points) Given a1, . . . , ak, let Ak = [a1, . . . , ak] and x˜i = xi − AA>xi
. We wish to find ak+1, to
maximize 1
n
P
i
(a
>
k+1x˜i)
2
. Show that ak+1 is the (k + 1)th right singular vector of X.
2.2 Dimensionality reduction via optimization (22 points)
We will now motivate the dimensionality reduction problem from a slightly different perspective. The resulting
algorithm has many similarities to PCA. We will refer to method as DRO.
As before, you are given data {xi}
n
i=1, where xi ∈ R
D. Let X = [x
>
1
; . . . x>
n
] ∈ R
n×D.
We suspect that the data
actually lies approximately in a d dimensional affine subspace. Here d < D and d < n. Our goal, as in PCA, is
to use this dataset to find a d dimensional representation z for each x ∈ R
D. (We will assume that the span of the
data has dimension larger than d, but our method should work whether n > D or n < D.)
Let zi ∈ R
d be the lower dimensional representation for xi and let Z = [z
>
1
; . . . ; z
>
n
] ∈ R
n×d
. We wish to find
parameters A ∈ R
D×d
, b ∈ R
D and the lower dimensional representation Z ∈ R
n×d
so as to minimize
J(A, b, Z) = 1
n
Xn
i=1
kxi − Azi − bk
2 = kX − ZA> − 1b
>k
2
F . (2)
Here, kAk
2
F =
P
i,j A2
ij is the Frobenius norm of a matrix.
1. (3 Points) Let M ∈ R
d×d be an arbitrary invertible matrix and p ∈ R
d be an arbitrary vector. Denote,
A2 = A1M−1
, b2 = b1 − A1M−1p and Z2 = Z1C
> + 1p
>. Show that both (A1, b1, Z1) and (A2, b2, Z2)
achieve the same objective value J (??).
Therefore, in order to make the problem determined, we need to impose some constraint on Z. We will assume
that the zi’s have zero mean and identity covariance. That is,
Z¯ =
1
n
Xn
i=1
zi =
1
n
Z
>1d = 0, S =
1
n
Xn
i=1
ziz
>
i =
1
n
ZZ> = Id
Here, 1d = [1, 1 . . . , 1]> ∈ R
d
and Id is the d × d identity matrix.
2. (16 Points) Outline a procedure to solve the above problem. Specify how you would obtain A, Z, b which
minimize the objective and satisfy the constraints.
Hint: The rank k approximation of a matrix in Frobenius norm is obtained by taking its SVD and then
zeroing out all but the first k singular values.
3. (3 Points) You are given a point x∗ in the original D dimensional space. State the rule to obtain the d
dimensional representation z∗ for this new point. (If x∗ is some original point xi from the D–dimensional
space, it shoud be the d–dimensional representation zi
.)
2.3 Dimensionality reduction via a generative model (42 points)
We will now study dimensionality reduction via a generative model. We will refer to method as DRLV. We will
assume a d(< n) dimensional latent space and the following generative process for the data.
z ∼ N (0, I), z ∈ R
d
x|z ∼ N (Az + b, η2
I), x ∈ R
D
The model says that we first sample a d dimensional Gaussian with zero mean and identity variance. Then we
map it to D dimensions by computing Az + b. Finally, we add some spherical Gaussian noise with variance η
2 on
each dimension.
We will use an expectation maximization (EM) procedure to learn the parameters A, b, η. So far we have only
studied EM with discrete latent variables. In this problem, we will look at EM with a continuous latent variable
which has a parametric distribution. The following results will be useful.
Fact 1 (Conditional of a Gaussian). Say (Y1, Y2), Yi ∈ R
di
is Gaussian distribued.
Y1
Y2
= N
µ1
µ2
,
Σ11 Σ12
Σ
>
12 Σ22
Then, conditioned on Y1 = y1 the distribution for Y2 is
Y2|Y1 = y1 ∼ N (µ2 + Σ>
12Σ
−1
11 (y1 − µ1), Σ22 − Σ
>
12Σ
−1
11 Σ12)
Fact 2 (Some Matrix Derivatives). Let X ∈ R
r×c
, and u ∈ R
r
, v, w ∈ R
c
.
∇Xv
>X>u = uv>
∇Xv
>X>Xw = X(vw> + wv>)
1. (10 Points) Assuming some given values for A, b, and η
2
, write down the joint distribution of (z, x). Use
this to derive the marginal distribution of x and the conditional distribution z|x.
2. (4 points) Write the log likelihood in terms of parameters A, b, and η
3. (4 Points) First obtain the Maximum Likelihood Estimate for b. This does not require EM.
4. (10 Points) To apply the EM algorithm, let Q(zi) denote some distribution over zi for each zi
. Obtain a
lower bound on the log likelihood via Jensen’s inequality.
5. (4 Points) Recall, from the lectures, that we chose Q(zi) = P(zi
|xi) in the E-step to obtain the tightest
possible lower bound for the log likelihood. Here, P(zi
|xi) is the conditional distribution of zi given xi
under the current estimates for A, b, and η. Write down the E-step update for the next iteration.
N.B: Unlike in GMMs, where the latent variable was discrete, here the latent variable is continuous. Fortunately, it has a parametric form we can represent Q(zi) using a finite number of parameters.
(Hint: See part 1)
6. (10 Points) Now write down the M-step update for parameters A and η, obtained by maximizing the lower
bound obtained from parts 3 and 4.
2.4 Experiment (42 points)
Here we will compare the above three methods on two data sets.
• We will implement three variants of PCA:
1. ”buggy PCA”: PCA applied directly on the matrix X.
2. ”demeaned PCA”: We subtract the mean along each dimension before applying PCA.
3. ”normalized PCA”: Before applying PCA, we subtract the mean and scale each dimension so that the
sample mean and standard deviation along each dimension is 0 and 1 respectively.
• One way to study how well the low dimensional representation Z captures the linear structure in our data
is to project Z back to D dimensions and look at the reconstruction error. For PCA, if we mapped it to d
dimensions via z = V x then the reconstruction is V
>z.
For the preprocessed versions, we first do this and
then reverse the preprocessing steps as well. For DRO we just compute Az + b. For DRLV, we will use the
posterior mean E[z|x] as the lower dimensional representation and Az + b as the reconstruction.
We will compare all four methods by the reconstruction error on the datasets.
• Please implement code for the five methods: Buggy PCA (just take the SVD of X) , Demeaned PCA,
Normalized PCA, DRO, DRLV. In all cases your function should take in an n × d data matrix and d as
an argument.
It should return the the d dimensional representations, the estimated parameters, and the
reconstructions of these representations in D dimensions. For DRLV, use the values obtained from DRO as
initializations for A. Set η based on the reconstruction errors of DRO. Use 10 iterations of EM.
• You are given two datasets: A two Dimensional dataset with 50 points data2D.csv and a thousand
dimensional dataset with 500 points data1000D.csv.
• For the 2D dataset use d = 1. For the 1000D dataset, you need to choose d. For this, observe the singular
values in DRO and see if there is a clear “knee point” in the spectrum. Attach any figures/ Statistics you
computed to justify your choice.
• For the 2D dataset you need to attach the a plot comparing the orignal points with the reconstructed points
for all five methods. For both datasets you should also report the reconstruction errors, that is the squared
sum of differences Pn
i=1 kxi − r(zi)k
2
, where xi’s are the original points and r(zi) are the D dimensional
points reconstructed from the d dimensional representation zi
.
• Questions: After you have completed the experiments, please answer the following questions.
1. Look at the results for Buggy PCA. The reconstruction error is bad and the reconstructed points don’t
seem to well represent the original points. Why is this?
Hint: Which subspace is Buggy PCA trying to project the points onto?
2. The error criterion we are using is the average squared error between the original points and the
reconstructed points. In both examples DRO and demeaned PCA achieves the lowest error among all
methods. Is this surprising? Why?
• Point allocation:
– Implementation of the three PCA methods: (10 Points)
– Implementation of DRO and DRLV: (20 points)
– Implementing reconstructions and reporting results: (5 points)
– Choice of d for 1000D dataset and appropriate justification: (3 Points)
– Questions (4 Points)
Partial answers: These were our errors on all methods for the 2D dataset and the reconstructions obtained for
Buggy PCA and Demeaned PCA. We have provided them to cross-check with your solution.
Our implementation may have bugs so if your answer does not tally, first double check with your peers and then speak to the
TA/Instrutor.
Reconstruction Errors:
Buggy PCA: 0.886903
Demeaned PCA: 0.010006
Normalized PCA: 0.049472
DRO: 0.010006
DRLV: 0.010081
0 2 4 6 8 10
0
2
4
6
8
10
Buggy PCA
0 2 4 6 8 10
0
2
4
6
8
10
Demeaned PCA
The blue circles are from the original dataset and the red crosses are the reconstructed points.