## Description

1 Clustering

In the lectures, we discussed k-means. Given a set of data points {xn}N

n=1, the method minimizes

the following distortion measure (or objective or clustering cost):

D = X

N

n=1

X

K

k=1

rnkkxn µkk2

2

where µk is the prototype of the k-th cluster. rnk is a binary indicator variable. If xn is assigned

to the cluster k, rnk is 1 otherwise rnk is 0. For each cluster, µk is the representative for all the

data points assigned to that cluster.

(a) In the lecture, we showed but did not prove that, µk is the mean of all such data points. That

is why the method has means in its name and we keep referring to µk as means, centroids,

etc. You will prove this rigorously next. Assuming all rnk are known (that is, you know the

assignments of all N data points), show that if µk is the mean of all data points assigned

to the cluster k, for any k, then the objective D is minimized. This justifies the iterative

procedure of k-means1.

(b) We now change the distortion measure to

D = X

N

n=1

X

K

k=1

rnkkxn µkk1

In other words, the measurement of “closeness” to the cluster prototype is now using L1 norm

(kzk1 = P

d |zd|).

Under this new cost function, show the µk that minimizes D can be interpreted as the

elementwise median of all data points assigned to the k-th cluster. (The elementwise median

of a set of vectors is defined as a vector whose d-th element is the median of all vectors’ d-th

elements.)

(c) Now assume that we apply a mapping (x) to map data points into feature space. Then, we

define the objective function of kernel K-means as

D˜ = X

N

n=1

X

K

k=1

rnkk(xn) µ˜kk2

2,

where µ˜k is the center of the cluster k in the feature space.

• Show that the D˜ can be represented in terms of only kernel K(xi, xj ) = (xi)T(xj).

• Describe and write down the equation of assigning a data point to its cluster. You

answer should only consist of kernel value K(xi, xj ).

• Write down the pseudo-code of the complete kernel K-means algorithm including initialization of cluster centers.

1More rigorously, one would also need to show that if all µk are known, then rnk can be computed by assigning

xn to the nearest µk. You are not required to do so.

1

2 Gaussian Mixture Model

Let our data be generated from a mixture of two univariate Gaussian distributions, where f(x|✓1)

is a Gaussian with mean µ1 = 0 and 2

1 = 1, and f(x|✓2) is a Gaussian with mean µ2 = 0 and

2

2 = 0.5. The only unknown parameter is the mixing parameter ↵ (which specifies the prior

probability of ✓1.). Now we observe a single sample x1, please write out the likelihood function of

x1 as a function of ↵, and determine the maximum likelihood estimation of ↵.

3 EM algorithm

Zero-inflated Poisson regression is used to model count data that has an excess of zero counts. For

example, the number of insurance claims within a population for a certain type of risk would be

zero-inflated by those people who have not taken out insurance against the risk and thus are unable

to claim. The observed data probability of observation i is:

p(xi) = (

⇡ + (1 ⇡)e if xi = 0

(1 ⇡)xi e

xi! if xi > 0.

Your task in this problem is to design an EM to estimate the parameter ⇡ and from observed

data {xi}N

i=1.

(a) Define a proper hidden variable zi for the observations (Hint: you only need hidden variables

for some observations) and use them to write down the complete likelihood function.

(b) Write down the update equations for both the E-Step and the M-step.

2

4 Programming

In this problem, you will implement three di↵erent clustering methods, K-means, Kernel K-means

and Gaussian Mixture Model. You will evaluate the performance of your method on two synthetic

datasets.

4.1 Data

You are provided with two datasets, hw5 blob.csv and hw5 circle.csv. Both datasets have two

dimensions.

4.2 Implement k-means

As we studied in the class, k-means tries to minimize the following distortion measure (or objective

function):

J = X

N

n=1

X

K

k=1

rnk||xn µk||2

2

where rnk is an indicator variable:

rnk = 1 if and only if xn belongs to cluster k

and (µ1,…, µk) are the cluster centers with the same dimension of data points.

(a) Implement k-means using random initialization for cluster centers. The algorithm should run

until none of the cluster assignments are changed. Run the algorithm for di↵erent values of

K 2 {2, 3, 5}, and plot the clustering assignments by di↵erent colors and markers. (you need

to report 6 plots, 3 for each dataset.)

(b) The k-means algorithm fails to separate the two circles in the hw5 circle.csv dataset. Please

explain why this happens.

4.3 Implement kernel k-means

Implement kernel k-means you derived in Problem 2 and evaluate it on the hw5 circle.csv dataset.

You should choose a kernel that can separate the two circles.

(a) Write down the choice of your kernel.

(b) Implement kernel k-means using random initialization for cluster centers (randomly pick

data points as centers of the clusters). The algorithm should run until none of the cluster

assignments are changed. Run the algorithm for K = 2, and plot the clustering assignments

by di↵erent colors and markers. (you need to report 1 plot)

4.4 Implement Gaussian Mixture Model

In this problem, you need to implement the EM algorithm to fit a Gaussian Mixture model on the

hw5 blob.csv dataset.

3

(a) Run 5 times of your EM algorithm with number of components K = 3, and plot the log

likelihood of the data over iterations of EM for each of runs. (The x-axis is the number of

iterations, and the y-axis is the log likelihood of the data given current model parameters.

Please plot all five curves in the same figure)

(b) For the best run in terms of log likelihood, (1) Plot the most likely cluster assignment of each

data point indicated by di↵erent colors and markers. (2) Report the mean and co-variance

matrix of all the three Gaussian components.

5 Submission Instructions

Submission Instruction: You need to provide the followings:

• Provide your all your answers that are required for grading in hardcopy. The graders

are not required to check your answers in the softcopy. The papers need to be stapled

and submitted into the locker#19 on the first floor of PHE building.

• Provide your answers to problems in *.pdf file, named as

CSCI567 hw5 fall16.pdf. You need to submit the homework in both hardcopy (at the

collection locker #19 at the PHE building 1st floor by 5:00pm of the deadline date)

and electronic version as *.pdf file on Blackboard. If you choose handwriting instead of

typing all the answers, you will get 40% points deducted.

• Submit ALL the code and report via Blackboard. The only acceptable language is

MATLAB or Python2.7. For your program, you MUST include the main script called

CSCI567 hw5 fall16.m or CSCI567 hw5 fall16.py in the root of your folder. After

running this main file, your program should be able to generate all of the results needed

for this programming assignment, either as plots or console outputs. You can have

multiple files (i.e your sub-functions), however, the only requirement is that once we

unzip your folder and execute your main file, your program should execute correctly.

Please double-check your program before submitting. You should only submit one

*.zip file. No other formats are allowed except *.zip file. Also, please name it as

[lastname] [firstname] hw2 fall16.zip.

Collaboration You may collaborate. However, collaboration has to be limited to discussion only

and you need to write your own solution and submit separately. You also need to list with

whom you have discussed.

4