Description

5/5 - (5 votes)

In this programming assignment you will implement and test several clustering algorithms on both
synthetic and real data. Given a set of points X = {x1, · · · , xn}, where xi ∈ R
d
, the goal is to
assign a cluster label to each point, yi ∈ {1, · · · , K}, where K is the number of clusters. In this
programming assignment, you will consider three clustering algorithms:
1. K-means algorithm – The K-means algorithm calculates the cluster centers µj using the
current assignment to each cluster (K is assumed to be known). In each iteration, K-means
performs the following two steps:
Cluster Assignment : zij =
(
1, j = argmink∈{1,··· ,K} kxi − µkk
2
0, otherwise.
(1)
Estimate Center : µj =
Pn
i=1
P
zijxi
n
i=1 zij
. (2)
The cluster label for point xi
is the label of the closest cluster center, yi = argmaxj
zij .
2. EM algorithm for Gaussian mixture models (EM-GMM) – The EM algorithm calculates a maximum likelihood estimate of a GMM with K components, with parameters
{πj , µj , Σj}
K
j=1. K is also assumed to be known. The E- and M-steps are:
E − step : ˆzij = p(zi = j|xi) = πjN (xi
|µj , Σj )
PK
k=1 πkN (xi
|µk, Σk)
. (3)
M − step : Nˆ
j =
Xn
i=1
zˆij , πˆj =
Nˆ
j
n
, µˆj =
1
Nˆ
j
Xn
i=1
zˆijxi
, (4)
Σˆ
j =
1
Nˆ
j
Xn
i=1
zˆij (xi − µˆj )(xi − µˆj )
T
. (5)
After convergence, the cluster label for xi
is the component with the largest posterior probability, yi = argmaxj
zˆij . The document “gmm-tips”, available on the course website, contains
some useful tips on implementing EM for GMMs.
3. Mean-shift algorithm – The mean-shift algorithm uses gradient ascent with an adaptive
step-size to find local peaks in a kernel density estimate of X. We will use a Gaussian kernel
with bandwidth h. Given an initial point ˆx
(0) (where the superscript denotes the iteration
number), the mean-shift algorithm update rule is:
xˆ
(t+1) =
Pn
i=1 xiN (xi
|xˆ
(t)
, h2
I)
Pn
i=1 N (xi
|xˆ
(t)
, h2I)
. (6)
To obtain the clustering, the mean-shift algorithm is run using each data point xi as an initial
point. After convergence, the points that converged to the same local peak are given the same
cluster label. In this case, the number of clusters K is not fixed (and depends on the selected
bandwidth h).
1
Problem 1 Clustering synthetic data
In the first problem, you will test these clustering methods on synthetic data, and examine how each
method performs on different configurations of data. The zip file PA2-cluster-data.zip contains
the synthetic data files. Inside the MATLAB data file cluster data.mat (or cluster data *.txt
if you are not using MATLAB) are three data sets (points and ground-truth labels). The file
contains the following variables:
• dataA X – dataset A points. Each column is a data point, xi ∈ R
2
.
• dataA Y – dataset A true labels, {yi}.
• dataB X – dataset B points.
• dataB Y – dataset B true labels.
• dataC X – dataset C points.
• dataC Y – dataset C true labels.
The three datasets are shown in the following figure, where the color represents the ground-truth
label.
−15 −10 −5 0 5 10 15
−15
−10
−5
0
5
10
15
dataA
x
1
x
2
−15 −10 −5 0 5 10 15
−15
−10
−5
0
5
10
15
dataB
x
1
x
2
−15 −10 −5 0 5 10 15
−15
−10
−5
0
5
10
15
dataC
x
1
x
2
The goal is to discover the clusters in each dataset using the data points X.
(a) Implement the three clustering algorithms. You will reuse these algorithms in the next problem,
so try to make them as general as possible.
(b) Run the algorithms on the three synthetic datasets. Qualitatively, how does each clustering
algorithm perform on each dataset? Comment on the advantages and limitations of each
algorithm, in terms of the configuration of the data.
(c) How sensitive is mean-shift to the bandwidth parameter h?
. . . . . . . . .
2
Problem 2 A real world clustering problem – image segmentation
In this problem we will consider a real world clustering problem, image segmentation, where the
goal is to find homogeneous regions in the image, which hopefully belong to objects or object parts.
Below is one of the test images and an example segmentation using K-means (In the right image,
each segment is colored based on the average color within the segment).
original
100 200 300 400
50
100
150
200
250
300
segments
100 200 300 400
50
100
150
200
250
300
segments (colored)
100 200 300 400
50
100
150
200
250
300
To segment the image, we first extract feature vectors from each pixel in the image (or a subset of
pixels along a regular grid). In particular, we form a window centered around each pixel and extract
a four-dimensional feature vector, x = [u, v, cy, cx]
T
, where (u, v) are the average chrominance values
(color without brightness) in the window, and (cx, cy) is the pixel location (center of the window
in x-y-coordinates). The feature values for the example image are shown below (each subplot
represents one feature dimension over the entire image).
x
1
100 200 300 400
50
100
150
200
250
300 70
80
90
100
110
120
x
2
100 200 300 400
50
100
150
200
250
300 120
140
160
180
x
3
100 200 300 400
50
100
150
200
250
300
50
100
150
200
250
300
x
4
100 200 300 400
50
100
150
200
250
300
100
200
300
400
Next, a clustering algorithm is used to group the feature vectors, and a cluster label is assigned to
each corresponding pixel forming the segmentation.
The zip file PA2-cluster-images.zip contains the images and MATLAB code for extracting the features, and creating the segmentation image from the cluster labels. The following
MATLAB functions are provided:
• getfeatures.m – MATLAB function for extracting features from an image.
• labels2segm.m – MATLAB function for generating a segmentation image from the cluster
labels.
3
• colorsegm.m – MATLAB function to create a nicely colored segmentation image.
As an example, the following code was used to generate the above segmentation images:
% read the image and view it
img = imread(’images/12003.jpg’);
subplot(1,3,1); imagesc(img); axis image;
% extract features (stepsize = 7)
[X, L] = getfeatures(img, 7);
XX = [X(1:2,:) ; X(3:4,:)/10]; % downscale the coordinate features (see part (b))
% run kmeans — this is the MATLAB version. You have to write your own!
[Y, C] = kmeans(XX’, 2);
% make a segmentation image from the labels
segm = labels2segm(Y, L);
subplot(1,3,2); imagesc(segm); axis image;
% color the segmentation image
csegm = colorsegm(segm, img);
subplot(1,3,3); imagesc(csegm); axis image
Documentation for each function can be viewed in MATLAB using “help getfeatures”, etc.
For Python users, the file PA2-cluster-python.zip contains Python versions of the above
demo code and helper functions for extracting features. This Python code requires the numpy,
scipy, matplotlib, and Image modules. Alternatively, the file PA2-cluster-imfeatures-txt.zip
contains a text files of the extracted image features for a step size of 7.
(a) Use the three clustering algorithms to segment a few of the provided images. Qualitatively,
which algorithm gives better results? How do the results change with different K and h?
Which is less sensitive to changes in the parameters? Comment on any interesting properties
or limitations observed about the clustering algorithms.
The feature vector x contains two types of features that are on different scales; the chrominance
values (u, v) range from 0-255, while the pixel locations (cx, cy) range from 0-512. The EM-GMM
clustering algorithm can adapt to the different scales of each feature by changing the covariance
matrix. On the other hand, K-means and mean-shift assume a fixed isotropic covariance matrix.
We can modify these two algorithms to allow different scaling of the features:
• For K-means, the distance to a cluster center x
0
is modified to apply a weighting between the
feature types,
d(x, x0
) =

u
v

−

u
0
v
0

2
+ λ

cx
cy

−

c
0
x
c
0
y

2
, (7)
where λ ≥ 0 controls the weighting.
• For mean-shift, the kernel is modified to use separate bandwidths on each feature type,
k(x, x0
) = 1
(2π)
2h
2
ph
2
c
exp (
−
1
2h
2
c

u
v

−

u
0
v
0

2
−
1
2h
2
p

cx
cy

−

c
0
x
c
0
y

2
)
, (8)
4
where hc is the bandwidth for the color features, and hp is the bandwidth for the pixel
locations.
(b) Modify your K-means and mean-shift implementations to allow different feature scaling. Hint:
changing the distance in (7) or kernel in (8) is equivalent to scaling each dimension in the feature
vector x by an appropriate amount. Rerun the segmentation experiment. Do the segmentation
results improve?
. . . . . . . . .
Problem 3 Quantitative Evaluation of Clustering (Optional)
In this optional problem, you will quantitatively evaluate your results from Problems 1 and 2.
Consider a set S = {s1, · · · , sn} with n elements, and two partitions of S into subsets, Y =
{Y1, · · · , YR} and Z = {Z1, · · · , ZC}, where Yi are the subsets of Y, and similarly for Zi and Z.
The Rand index can be used to quantitatively measure the amount of agreement between
the two clusterings of the set S. Intuitively, the Rand index corresponds to the probability of
pair-wise agreement between the Y and Z, i.e. the probability that the assignment of any two
items will be correct with respect to each other (in the same cluster, or in different clusters).
Formally, the Rand index is calculated as follows. Define the following counts:
• a – the number of pairs in S that are in the same set in Y and in the same set in Z,
• b – the number of pairs in S that are in different sets in Y and in different sets in Z,
• c – the number of pairs in S that are in the same set in Y and in different sets in Z,
• d – the number of pairs in S that are in different sets in Y and in the same set in Z,
then the Rand index is the fraction of pairwise agreements
Rand =
a + b
a + b + c + d
=
a + b

n
2
. (9)
The Rand index can be efficiently calculated from a contingency table:
Partition Z
Class Z1 Z2 . . . ZC Sums
Partition Y
Y1 n11 n12 . . . n1C r1
Y2 n21 n22 . . . n2C r2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
YR nR1 nR2 . . . nRC rR
Sums c1 c2 . . . cC n
Each entry nij is the number of items in S that are common to Yi and Zj , and cj and ri are the
sums over the jth column and ith row, respectively. Using the contingency table, the number of
agreements (the numerator in (9)) can be calculated as
a + b =

n
2

+
X
R
i=1
X
C
j=1
n
2
ij −
1
2


X
R
i=1
r
2
i +
X
C
j=1
c
2
j

 . (10)

CS5487 Programming Assignment 2 Clustering

Description

Related products

TCSS142 Introduction to Object-Oriented Programming Programming Assignment 2

CS 211: Programming Assignment 2: One-Shot Learning and Sudoku

Programming Assignment 2: Burrows–Wheeler Transform and Suffix Arrays