- Home
- Uncategorized
- CMSC25400/STAT 27725 Homework 1

$30.00

Category: Uncategorized

Description

5/5 - (2 votes)

1. (30 points) As we saw in class, k-means clustering minimizes the average square distance distortion

Javg2 =

∑

k

j=1

∑

x∈Cj

d(x,mj )

2

, (1)

where d(x, x

′

) = ∥x−x

′∥, and Cj is the set of points belonging to cluster j. Another distortion function

that we mentioned is the intra-cluster sum of squared distances,

JIC =

∑

k

j=1

1

|Cj |

∑

x∈Cj

∑

x′∈Cj

d(x, x

′

)

2

.

(a) Given that in k-means, mj =

1

|Cj |

∑

x∈Cj

x, show that JIC = 2Javg2 .

(b) Let γi ∈ {1, 2, . . . , k} be the cluster that the i’th datapoint is assigned to, and assume that there

are n points in total, x1, x2, . . . , xn. Then (1) can be written as

Javg2 (γ1, . . . , γn,m1, . . . ,mk) = ∑n

i=1

d(xi

,mγi

)

2

. (2)

Recall that k-means clustering alternates the following two steps:

1. Update the cluster assignments:

γi ← argmin

j∈{1,2,…,k}

d(xi

,mj ) i = 1, 2, . . . , n.

2. Update the centroids:

mj ←

1

|Cj |

∑

i : γi=j

xi j = 1, 2, . . . , k.

Show that the first of these steps minimizes (2) as a function of γ1, . . . , γn, while holding m1, . . . ,mk

constant, while the second step minimizes it as a function of m1, . . . ,mk, while holding γ1, . . . , γn

constant. The notation “i : γi =j” should be read as “all i for which γi =j”.

(c) Prove that as k-means progresses, the distortion decreases monotonically iteration by iteration.

(d) Give an upper bound on the maximum number of iterations required for full convergence of the

algorithm, i.e., the point where neither the centroids, nor the cluster assignments change anymore.

1

2. (30 points) Implement the k-means algorithm in a language of your choice (MATLAB, Python, or R

are recommended), initializing the cluster centers randomly, as explained in the slides. The algorithm

should terminate when the cluster assignments (and hence, the centroids) don’t change anymore.

Note: if you use a relatively high level language like MATLAB or Python, your code can use linear

algebra primitives, such as matrix/vector multiplication, eigendecomposition, etc. directly. However,

you are expected to write you own k–means and k–means++ functions from scratch. Please don’t

submit code consisting of a single call to some pre-defined “kmeans” function.

(a) The toy dataset toydata.txt contains 500 points in R

2

coming from 3 well separated clusters.

Test your code on this data and plot the final result as a 2D plot in which each point’s color or

symbol indicates its cluster assignment. Note that because of the random initialization, different

runs may produce different results, and in some cases the algorithm might not correctly identify

the three clusters. Plot the value of the distortion function as a function of iteration number for

20 sepearate runs of the algorithm on the same plot. Comment on the plot.

(b) Now implement the k–means++ algorithm discussed in class and repeat part (a) using its result

as intialization (except for the 2D plot). Comment on the convergence behavior of k–means++

vs. the original algorithm.

3. (up to 20 points extra credit) You likely found that on the “toydata” dataset, most of the time

even vanilla k–means clustering produces acceptable solutions. Create a dataset of your own for which

(on average over many runs) k–means++ improves the performance of k–means by at least a factor of

10 in terms of the distortion function value of the final clustering. Submit the code that you used to

generate the dataset.

2