## Description

## 1 Conceptual Questions (10 points)

a. (L3, 2’) Given the same parameters, does DBSCAN always output the same cluster

assignment or not? Briefly explain.

b. (L3, 2’) With improper choice of initial cluster centers, K-Means can give us a very

bad clustering output. Could you propose a method to alleviate the issue?

c. (L1, 2’) Between k-Medoids (PAM) and k-Means, which is more ecient? Briefly

explain.

d. (L1, 2’) OPTICS is superior to DBSCAN because we do not have to set the paramenter

✏. True or False? Briefly explain.

e. (L1, 2’) Is it true that Lazy learners (e.g., kNN) is more ecient than non-lazy learners

(e.g., decision trees)?

## 2 k-Nearest Neighbors (10 points)

You will use k-nearest-neighbor classifier to predict labels for new data points, and investigate under what situations kNN works well. The set of labeled data points are given in

Tabel 1.

Purpose

• Understand how kNN works and its pros and cons.

Requirements

• No need to use distance-weighted labels of nearest neighbors.

• Show your calculations for questions asking for a number.

id x y label

1 0 0 +1

2 0 1 +1

3 1 0 -1

4 1 1 -1

5 1.5 0.5 +1

6 2 0.5 -1

Table 1: Data for kNN

a. (L2, 2’) Plot the given labeled data in the x-y plane, and highlight the regions where

data will be classified as ‘+1’ by a 1NN classifier (i.e, kNN for k = 1).

b. (L2, 2’) Suppose we use a 3NN classifier based on 100 (labeled) data points uniformly

distributed in a d-dimensional unit cube, i.e. [0, 1]d. For d = 2 (i.e, the unit square),

what is the minimum size of a d-dimensional cube centered at q for it to cover at least

3 neighbors of q in expectation? Show your steps. (Note that there is no need to

consider the special case in which the query point q lies near the boundary of the unit

cube.)

c. (L2, 2’) Calculate the minimum cube size asked in (b) for d = 100.

d. (L3, 2’) According to your calculations in (b) and (c), do you see any problem in

applying a kNN classifier to data in high-dimensional spaces (e.g. d = 100)? Briefly

explain. (Hint: consider the underlying principle of why kNN works generally.)

e. (L3, 2’) What are the pros and cons of using a large k for the kNN classifier?

## 3 Perceptrons (12 points)

You will design neurons and neural networks to implement some specified functions.

Purpose

• Understand how neurons and neural network approximates functions.

• See the di↵erent modeling capacities of perceptrons and multiple-layer neural networks

(multiplayer perceptrons).

Requirements

• Show the neurons and neural networks using diagrams.

a. (L2, 2’) Given x1, x2 2 {0, 1}, design a neuron that implements the logical operation

AND. That is, the neuron takes x1 and x2 as input and computes y = x1 AND x2 as

output (y 2 {0, 1}) (Hint: use the activation function threshold(a)=1 if a > 0 and 0

otherwise.)

b. (L2, 2’) Design a neuron that outputs y = x1 OR x2.

c. (L2, 2’) Design a neuron that takes a single input x 2 {0, 1} and outputs y = NOT x.

d. (L3, 4’) Design a neural network that takes x1, x2 2 {0, 1} as input and computes

y = x1 XOR x2 as output (x1 XOR x2 = 1 if and only if x1 6= x2). (Hint: you can

reuse the neurons you designed in (a)–(c))

e. (L3, 2’) Is it possible to implement XOR using a single layer neural network (without

any hidden layer)? Explain your answer.

4 Hierarchical Agglomerative Clustering and B-Cubed

Evaluation (8 points)

Purpose

• Understand how AGNES and B-Cubed work.

Requirements

• In sub-question a, only draw the dendrogram.

• In sub-question b, only list the members of each cluster.

• In sub-question c, show detailed calculations of B-Cubed precision and recall.

Suppose we have 13 data points as listed and plotted in Figure 1. The ground truth (the

correct clustering) is also provided.

Point x y Ground-truth cluster

P1 1 3 C1

P2 1 2 C1

P3 2 1 C1

P4 2 2 C1

P5 2 3 C1

P6 3 2 C1

P7 4 3 C1

P8 6 3 C2

P9 4 5 C2

P10 5 4 C2

P11 5 5 C2

P12 6 4 C2

P13 6 5 C2

Figure 1: Data and ground truth

a. (L2, 4’) Draw the dendrogram using AGNES. Please use single link and Euclidean

distance as the dissimilarity measure.

b. (L2, 2’) If we want to cluster the dataset into 3 groups based on the dendrogram, what

are the members of each of the 3 groups?

c. (L2, 2’) Based on the given ground truth, what are the B-Cubed precision and recall

of the output?

## 5 K-Means (8 points)

Purpose

• Understand how k-Means works.

Requirements

• In sub-question a and b, for each iteration of k-Means:

– Annotate the data points in figure 1 to show which points belong to which clusters,

e.g., draw a red circle at each point belonging to cluster 1, and green and blue

circles for cluster 2 and cluster 3. (The file for the figure is provided in data.zip.)

– Plot the mean of each cluster with the same color you use for data points in this

cluster, but in di↵erent shape to di↵erentiate them from data points.

– Show the coordinates of the cluster centers in each iteration.

– Do not include scanned pictures. You might use annotation or image processing

tools such as Mac Preview or Microsoft Paint to annotate the file.

– Do not show numerical distances in each iteration.

• In sub-question c, any reasonable method with brief explanation is acceptable.

We will use the same data as question 4 (figure 1).

a. (L2, 4’) Perform k-means using Euclidean distance with k = 2 and the initial cluster

centers P1 and P2.

b. (L2, 2’) Perform k-means using Euclidean distance with k = 2 and the initial cluster

centers P2 and P12.

c. (L3, 2’) As you can see in sub-questions a and b, choices of initial cluster centers

can a↵ect the speed of the clustering process. Assume k = 2 and the data are 2-

dimensional, suggest a general and reasonable way to select initial cluster centers to

speed up the clustering process. Briefly justify your choice.

6 Machine Problem (50 points)

Purposes

• Get deeper understanding and working experience of DBSCAN algorithm and B-Cubed

evaluation through implementation and result visualization.

• Get hands-on experience with cluster analysis using Weka.

General Requirements

• This is the second and the last MP (not a mini one) we have in this course. It consists

of several programming tasks, which usually take more time than written assignments,

so please start early.

• This is an individual assignment, which means it is OK to discuss with your classmates

and TAs regarding the methods, but it is not OK to work together or share code.

• Libraries or programs of cluster analysis algorithms can be found on-line, but you are

prohibited from using these resources directly. This means you can not include external

libraries, or modify existing programs, since the purpose of this MP is to help you go

through DBSCAN step by step.

• You can use Java/C++/Python as your programming language. No other languages

are allowed.

• Each sub question will contain detailed requirements. Please note that most of them

require you to not only submit code but also write discussions, report results, or draw

graphs. If you submit code without answering those questions, you will receive 0 point.

• Put all your codes in a separate folder with the name NetId assign5 codes. Do

not use sub-folders inside this folder. All of your codes should have been successfully

compiled before submission. Do not include files other than the codes you write. Put

a single readme.txt file in the code folder to briefly describe your codes and how to

run them.

• Put the results you generate in another folder with the name NetId assign5 results.

• Compress folder NetId assign5 codes and NetId assign5 results into a single zip

file, and name it NetId assign5.zip. Submit both this zip file and Answer Document

NetId assign5 answer.pdf through Compass2g. Please do not zip the PDF file!

• If you copy source code from other students or external sources, you will receive a

serious penalty. We will run plagiarism detection software. Please do not cheat.

Description of Data

• You can find data-hw5.zip from the course website. It contains three files: data.txt,

data.arff, and truth.txt.

• The description of data.txt is as follows:

– The first line contains an integer n as the number of data points.

– Each of the n following lines corresponds to a 2-dimensional data point, which

contains two floating-point numbers as the coordinates of a point. They are

separated by a comma.

• File data.arff is a Weka-friendly version of data.txt.

• File truth.txt is the ground truth of clustering for the data in data.txt. It is handlabeled by domain experts. The description of truth.txt is as follows:

– The first line contains an integer n as the number of data points.

– The second line contains an integer m as the number of clusters.

– Each of the n following lines corresponds to a 2-dimensional data point in data.txt,

which contains three numbers, where the first two are the coordinates of the point,

and the last one is an integer c 2 {1 …m} indicating the point belongs to cluster

c, and c = 0 if the point is an outlier. The three numbers are separated by commas. Please note that the name/index of each cluster is unimportant. It is just a

notation to indicate which points belong to the same cluster.

Step 0: (00

) Normalizing Data.

The first step you need to do is preprocessing data. In this assignment, you should

min-max normalize each dimension of the data. The formula is as follows:

x normalized = x minX

maxX minX

We also provide you with the min-max-normalized data in file data-normalized-hw5.zip

from the course website.

Step 1: (230

, L3) Implementing DBSCAN.

In this step, you need to implement DBSCAN:

• Name: dbscan.java, or dbscan.py, or dbscan.cpp

• Input

– Dataset file, e.g., /home/mike/mike assign5/mike assign5 data/data.txt

– Output file, e.g., /home/mike/mike assign5/mike assign5 results/step1.txt

– Parameter M inP ts, e.g, 25

– Parameter ✏, e.g, 0.065

• The structure of output file:

– The first line contains minP ts

– The second line contains ✏

– The third line contains the number of clusters

– Each of the n following lines corresponds to a 2-dimensional data point in

data.txt, which contains three numbers, where the first two are the coordinates of the point, and the last one is integer c 2 {1 …m} if the point

belongs to cluster c, or c = 0 if the point is an outlier. The three numbers

are separated by comma characters.

As it is hard for humans to interpret the output, you need to visualize it. In particular,

do a scatterplot for all the data points, and color the points according to the clustering.

For example, all outlier points are black, all the points belonging to cluster 1 are red, all

the points belonging to cluster 2 are blue, etc. You can create the plot using MATLAB,

Excel or any tools you like. You do not need to include the code for creating the graph.

Requirements:

• Put your DBSCAN source code to folder NetId assign5 codes

• Run your program on the given data file with M inP ts = 25 and ✏ = 0.065 so that

we can check the correctness of your code by looking at the output file. Name

the output file as step1.txt, and put it to folder NetId assign5 results.

• Include in your Answer Document two scatterplots, one for your output and one

for the ground truth.

• Question to ponder A: How long does it take for your machine to run DBSCAN

on the given data? What is the time complexity of DBSCAN in the worst case

scenario? Can you describe a worst case scenario?

• Question to ponder B: Usually, a good clustering algorithm should put similar

points to the same cluster, and dissimilar points to di↵erent clusters. Based on

that intuition, what do you think of the clustering result? Given the same M inP ts,

should we increase or decrease ✏ to get more intuitive clustering results?

Step 2: (100

, L3) Parameter tuning for DBSCAN.

Given M inP ts = 25, find two new ✏ such that the numbers of clusters in the corresponding results are di↵erent from each other and from the ✏ in step 1.

Requirements:

• Run DBSCAN two times with the two ✏ described above, and create two output files with the format similar to step1.txt. Name them as step2a.txt and

step2b.txt, and put them to folder NetId assign3 results

• Create a table in the answer document to report the number of clusters, and the

number of outliers corresponding to each of the two new ✏ and the ✏ in step 1

(your table should have three rows excluding the header).

• Draw two scatter plots corresponding to the two new ✏. Include them in the

Answer Document.

• Question to ponder C: Which one of three ✏ is the best? Could you suggest a

general way to tune parameters M inP ts and ✏ for DBSCAN?

Step 3: (70

, L3) Implementing B-Cubed evaluation.

In this step, you need to implement B-Cubed evaluation:

• Name: bcubed.java, or bcubed.py, or bcubed.cpp

• Input: two parameters:

– Clustering output file, e.g., /home/mike/mike assign5/mike assign5 results/step1.txt

– Ground truth file, e.g., /home/mike/mike assign5/mike assign5 data/truth.txt

• Output: Two lines in stdout

– The first line contains precision

– The second line contains recall

Requirements:

• Put your B-Cubed source code to folder NetId assign5 codes

• Run B-Cubed with the outputs from Step 1 and Step 2. Create a table to report

the precision and recall corresponding to each of the outputs.

• Question to ponder D: Can you suggest a way to combine B-Cubed precision and

recall into a single measure so that it would be easier to compare di↵erent results?

Step 4: (100

, L2) Doing cluster analysis by Weka.

In class, we have demonstrated how to do this. You can watch the online lecture video

to review the process. We also provide the procedure below.

(a) Open Weka.

(b) Choose Explorer.

(c) Click on Preprocess tag.

(d) Click on Open file, choose data.arff , click on Choose and wait until Weka

finishes loading the data.

(e) Click on Cluster tag. Here you can find common algorithms for cluster analysis,

such as OPTICS, DBSCAN, etc.

Requirements:

• Run SimpleKMeans with di↵erent number of clusters. Report the best number

of clusters, and the plot showing the corresponding clustering result. You can

generate the plot by right clicking on the entry of the Result list, and choose

Visualize cluster assignments.

• Run DBSCAN with the best minP ts and ✏ in Step 2. Show the plot for the

corresponding clustering result.

• Question to ponder E: Compare the results from k-Means and DBSCAN. Is kMeans or DBSCAN more suitable for the provided dataset? Why?