CSCI567 Homework #5

$30.00

Category:

Description

5/5 - (3 votes)

1 Clustering
In the lectures, we discussed k-means. Given a set of data points {xn}N
n=1, the method minimizes
the following distortion measure (or objective or clustering cost):
D = X
N
n=1
X
K
k=1
rnkkxn µkk2
2
where µk is the prototype of the k-th cluster. rnk is a binary indicator variable. If xn is assigned
to the cluster k, rnk is 1 otherwise rnk is 0. For each cluster, µk is the representative for all the
data points assigned to that cluster.
(a) In the lecture, we showed but did not prove that, µk is the mean of all such data points. That
is why the method has means in its name and we keep referring to µk as means, centroids,
etc. You will prove this rigorously next. Assuming all rnk are known (that is, you know the
assignments of all N data points), show that if µk is the mean of all data points assigned
to the cluster k, for any k, then the objective D is minimized. This justifies the iterative
procedure of k-means1.
(b) We now change the distortion measure to
D = X
N
n=1
X
K
k=1
rnkkxn µkk1
In other words, the measurement of “closeness” to the cluster prototype is now using L1 norm
(kzk1 = P
d |zd|).
Under this new cost function, show the µk that minimizes D can be interpreted as the
elementwise median of all data points assigned to the k-th cluster. (The elementwise median
of a set of vectors is defined as a vector whose d-th element is the median of all vectors’ d-th
elements.)
(c) Now assume that we apply a mapping (x) to map data points into feature space. Then, we
define the objective function of kernel K-means as
D˜ = X
N
n=1
X
K
k=1
rnkk(xn) µ˜kk2
2,
where µ˜k is the center of the cluster k in the feature space.
• Show that the D˜ can be represented in terms of only kernel K(xi, xj ) = (xi)T(xj).
• Describe and write down the equation of assigning a data point to its cluster. You
answer should only consist of kernel value K(xi, xj ).
• Write down the pseudo-code of the complete kernel K-means algorithm including initialization of cluster centers.
1More rigorously, one would also need to show that if all µk are known, then rnk can be computed by assigning
xn to the nearest µk. You are not required to do so.
1
2 Gaussian Mixture Model
Let our data be generated from a mixture of two univariate Gaussian distributions, where f(x|✓1)
is a Gaussian with mean µ1 = 0 and 2
1 = 1, and f(x|✓2) is a Gaussian with mean µ2 = 0 and
2
2 = 0.5. The only unknown parameter is the mixing parameter ↵ (which specifies the prior
probability of ✓1.). Now we observe a single sample x1, please write out the likelihood function of
x1 as a function of ↵, and determine the maximum likelihood estimation of ↵.
3 EM algorithm
Zero-inflated Poisson regression is used to model count data that has an excess of zero counts. For
example, the number of insurance claims within a population for a certain type of risk would be
zero-inflated by those people who have not taken out insurance against the risk and thus are unable
to claim. The observed data probability of observation i is:
p(xi) = (
⇡ + (1 ⇡)e if xi = 0
(1 ⇡)xi e
xi! if xi > 0.
Your task in this problem is to design an EM to estimate the parameter ⇡ and from observed
data {xi}N
i=1.
(a) Define a proper hidden variable zi for the observations (Hint: you only need hidden variables
for some observations) and use them to write down the complete likelihood function.
(b) Write down the update equations for both the E-Step and the M-step.
2
4 Programming
In this problem, you will implement three di↵erent clustering methods, K-means, Kernel K-means
and Gaussian Mixture Model. You will evaluate the performance of your method on two synthetic
datasets.
4.1 Data
You are provided with two datasets, hw5 blob.csv and hw5 circle.csv. Both datasets have two
dimensions.
4.2 Implement k-means
As we studied in the class, k-means tries to minimize the following distortion measure (or objective
function):
J = X
N
n=1
X
K
k=1
rnk||xn µk||2
2
where rnk is an indicator variable:
rnk = 1 if and only if xn belongs to cluster k
and (µ1,…, µk) are the cluster centers with the same dimension of data points.
(a) Implement k-means using random initialization for cluster centers. The algorithm should run
until none of the cluster assignments are changed. Run the algorithm for di↵erent values of
K 2 {2, 3, 5}, and plot the clustering assignments by di↵erent colors and markers. (you need
to report 6 plots, 3 for each dataset.)
(b) The k-means algorithm fails to separate the two circles in the hw5 circle.csv dataset. Please
explain why this happens.
4.3 Implement kernel k-means
Implement kernel k-means you derived in Problem 2 and evaluate it on the hw5 circle.csv dataset.
You should choose a kernel that can separate the two circles.
(a) Write down the choice of your kernel.
(b) Implement kernel k-means using random initialization for cluster centers (randomly pick
data points as centers of the clusters). The algorithm should run until none of the cluster
assignments are changed. Run the algorithm for K = 2, and plot the clustering assignments
by di↵erent colors and markers. (you need to report 1 plot)
4.4 Implement Gaussian Mixture Model
In this problem, you need to implement the EM algorithm to fit a Gaussian Mixture model on the
hw5 blob.csv dataset.
3
(a) Run 5 times of your EM algorithm with number of components K = 3, and plot the log
likelihood of the data over iterations of EM for each of runs. (The x-axis is the number of
iterations, and the y-axis is the log likelihood of the data given current model parameters.
Please plot all five curves in the same figure)
(b) For the best run in terms of log likelihood, (1) Plot the most likely cluster assignment of each
data point indicated by di↵erent colors and markers. (2) Report the mean and co-variance
matrix of all the three Gaussian components.
5 Submission Instructions
Submission Instruction: You need to provide the followings:
• Provide your all your answers that are required for grading in hardcopy. The graders
are not required to check your answers in the softcopy. The papers need to be stapled
and submitted into the locker#19 on the first floor of PHE building.
• Provide your answers to problems in *.pdf file, named as
CSCI567 hw5 fall16.pdf. You need to submit the homework in both hardcopy (at the
collection locker #19 at the PHE building 1st floor by 5:00pm of the deadline date)
and electronic version as *.pdf file on Blackboard. If you choose handwriting instead of
typing all the answers, you will get 40% points deducted.
• Submit ALL the code and report via Blackboard. The only acceptable language is
MATLAB or Python2.7. For your program, you MUST include the main script called
CSCI567 hw5 fall16.m or CSCI567 hw5 fall16.py in the root of your folder. After
running this main file, your program should be able to generate all of the results needed
for this programming assignment, either as plots or console outputs. You can have
multiple files (i.e your sub-functions), however, the only requirement is that once we
unzip your folder and execute your main file, your program should execute correctly.
Please double-check your program before submitting. You should only submit one
*.zip file. No other formats are allowed except *.zip file. Also, please name it as
[lastname] [firstname] hw2 fall16.zip.
Collaboration You may collaborate. However, collaboration has to be limited to discussion only
and you need to write your own solution and submit separately. You also need to list with
whom you have discussed.
4