## Description

1 Theory

I: Parameterized Cluster Distance 10 points

Consider a dataset clustered into K clusters, where the i

th cluster Ci consists of ni data points

(1 ≤ i ≤ K). Further, assume that there is some notion of a distance measure between two clusters

Ci and Cj , denoted by di,j . Generally, if Ci and Cj are merged to form a new cluster Ci∪j , then its

distance to some other cluster Ck may not have a simple relation to di,k and dk,j . However, consider

the equation, expressing it as a parameterized linear combination of the pre-merge weights:

dk,i∪j = αidi,k + αjdk,j + βdi,j + γ |di,k − dk,j |

Show that if β = 0, then there exist real values of αi

, αj , and γ such that1

dk,i∪j = min(di,k, dk,j ) , and

dk,i∪j = max(di,k, dk,j )

II: Error Probability for k-NN 10 points

Consider the special case where k = 1. Further, assume that we have two classes C1 and C2 with

equal priors, i.e., P(C1) = P(C2) = 0.5. Finally, you are given that the data is obtained from the

two distributions

P(x|C1) = (

2x if 0 ≤ x ≤ 1

0 otherwise

P(x|C2) = (

2(1 − x) if 0 ≤ x ≤ 1

0 otherwise

(a) What is the discriminant function (i.e., the function that decides whether a test data point

belongs to C1 or C2) for this classification?

(b) What is the classification error? [Hint: You need to find the total probability of errors.]

III: Simpler Computation in k-NN 10 points

Computing distances in high dimensions may sometimes be prohibively expensive. An often-used

technique is to do some distance-related computations in lower dimensions as preliminary filtering.

1Of course, both will never be simultaneously true, so you have to find different parameter values for each.

1

CSE 353 Homework III Submission Deadline: May 12, 2017 (Sunday)

(a) Given x = {x1, . . . , xd} and y = {y1, . . . , yd}, two vectors in a d-dimensional space, use the

following inequality called Jensen’s inequality

f(E(z) ≤ E(f(z)) for all convex functions f

to show that

”

1

√

d

X

d

i=1

xi −

1

√

d

X

d

i=1

yi

#2

≤

X

d

i=1

(xi − yi)

2

(b) Explain what the above inequality means in terms of distance computations, and discuss how

this property can be used to reduce the computational complexity of the process of finding

the nearest neighbor of a test data point. This discussion need not be a formal proof.

IV: Linear Separability 5 points

Suppose you have a dataset where the decision boundary is the d-dimensional hyperellipse given by

X

d

i=1

x

2

i

a

2

i

= 1

Find the transformation that maps the input space into the feature space such that the positive

and negative examples become linearly separable in the feature space.

2 Programming & Experiments

In this third and final assignment of the semester, you are not required to write your own code from

scratch for any algorithm. Instead, you should use two machine learning libraries to use support

vector machines (SVM). The goal is to explore SVM for a specific learning task, and learn a model

that does not overfit. Your model will be tested on a small dataset that will not be provided to

you (we are effectively splitting a part of the original dataset, and keeping it for separate testing).

2.1 Dataset

The training data sample is available for download (link provided on our course web page). It is a

single .csv file, 22.4 MB. Please right-click and download. Otherwise your browser may

crash while trying to open the file in preview.

2.2 The learning task

I have never played DOTA 2, but I know people who know people who have. It is a game between

two teams of 5 players each. Each team chooses 10 “heroes” out of a set of 113. The data provided

is a single .csv file where each line corresponds to a single game, and consists of

Index 0: Team won the game (1 or -1)

Index 1: Cluster ID (related to location)

Index 2: Game mode

Index 3: Game type (e.g. ”ranked”)

CSE 353 Homework III Submission Deadline: May 12, 2017 (Sunday)

Index 4 – end: Each element is an indicator for a hero. A value of 1 indicates that a player

from team ’1’ played as that hero, and a value of -1 indicated that a player from the other

team (i.e., the ’-1’ team) played as that hero (a hero can be selected by only one player each

game). So, each row has five ’1’ and five ’-1’ values. All other values are 0, indicating that no

player from either team chose that specific hero2

.

Your task is to build a model that can distinguish between the winning and losing teams, as specified

by index-0. This consists of the following:

V: LibSVM 30 points

Read up on how to install and run LibSVM here: https://www.csie.ntu.edu.tw/~cjlin/libsvm/.

Convert the given dataset into LibSVM’s format, and perform 5-fold cross validation to observe the

average accuracy across the folds. Your results are going to statistically insignificant if the accuracy

is fluctuating more than 10%, so try to play around with the various parameters and repeat the

experiments until you observe reasonably steady results. LibSVM allows you to save the model

as a file, so once you obtain the best model, you should retain that file. This model file must be

submitted along with your code.

For this first set of experiments, you may NOT use any other library. LibSVM is very well known,

and there are certainly other libraries that provide functions to convert a dataset into LibSVM’s

format. This conversion must be your own code.

VI: k-means clustering 25 points

The second set of experiments requires you to either use Weka 33

(if you want to use Java) or

scikit-learn4

(if you want to use Python). This time, you have to use k-means clustering to

create two clusters. That is, perform unsupervised learning instead of a supervised classification.

You will have to write some wrapper code to use these libraries, but there is no need to implement

the clustering algorithm on your own. Once the clustering is done, estimate its performance by

computing the following:

(a) Percentage split of team ‘1’ wins across two clusters (e.g., x% in cluster-1 and y% in cluster-2

(and similarly, also for team ‘-1’).

VII: Final Report 10 points

Your final report must contain the following details in some easy-to-understand manner:

• Instructions to run your LibSVM code.

• The set of parameter values you used to obtain your SVM model.

• Your SVM model’s performance in 5-fold cross validation. This must mention the accuracy

for each fold separately, as well as the final accuracy (i.e., the average across all 5 folds).

• A short paragraph describing any time-related issues you ran into while trying our different

parameters with LibSVM.

• Instructions to run your k-means code.

• The clustering performance, as described in Sec. 2.2.

• The set of parameter values you used to obtain the above performance.

2

https://github.com/kronusme/dota2-api/tree/master/data has the details about the clusters, game modes,

game types, and the heroes. They are for human understanding/interest only, and have no relevance to the algorithm.

3

https://www.cs.waikato.ac.nz/~ml/weka/

4

https://scikit-learn.org/stable/

CSE 353 Homework III Submission Deadline: May 12, 2017 (Sunday)

3 What to submit?

A single .zip file containing the following:

1. Your code that creates the input data for LibSVM to run.

2. The model file provided by LibSVM.

3. Your Python (or Java) code to run k-means clustering using scikit-learn (or Weka).

4. Your final report (at most 2 pages using 11 pt font and 1.5 line spacing) as a PDF document.

5. Your solutions to the “theory” component of this assignment as another PDF document.