# CS-677 Assignment: SVM & Clustering Assignment 6

\$30.00

## Description

In this assignment, you will implement k-means clustering and
use it to construct a multi-label classifier to determine the variery of wheat. For the dataset, we use ”seeds” dataset from the
machine Learning depository at UCI:
https://archive.ics.uci.edu/ml/datasets/seeds
Dataset Description: From the website: ”… The examined group comprised kernels belonging to three different
varieties of wheat: Kama, Rosa and Canadian, 70 elements
each, randomly selected for the experiment…”
There are 7 (continuous) features) F = {f1, . . . , f7} and a
class label L (Kama: 1, Rosa: 2, Canadian: 3).
1. f1: area A
2. f2: perimeter P
3. f3: compactness C = 4πA/P2
4. f4: length of kernel,
5. f5: width of kernel,
6. f6: asymmetry coefficient
7. f7: length of kernel groove.
Page 1
BU MET CS-677: Data Science With Python, v.2.0 CS-677 Assignment: SVM & Clustering
8. L: class (Kama: 1, Rosa: 2, Canadian: 3)
For the first question, you will choose 2 class labels as follows.
Take the last digit in your buid and divide it by 3. Choose the
following 2 classes depending on the remainder R:
1. R = 0: class L = 1 (negative) and L = 2 (positive)
2. R = 1: class L = 2 (negative) and L = 3 (positive)
3. R = 2: class L = 1 (negative) and L = 3 (positive)
Question 1: Take the subset of the dataset containing your
two class labels. You will use random 50/50 splits for training
and testing data.
1. implement a linear kernel SVM. What is your accuracy and
confusion matrix?
2. implement a Gaussian kernel SVM. What is your accuracy
and confusion matrix?
3. implement a polynomial kernel SVM of degree 3. What is
Question 2: Pick up any classifier for supervised learning
(e.g. kNN, logistic regression, Naive Bayesian, etc).
Page 2
BU MET CS-677: Data Science With Python, v.2.0 CS-677 Assignment: SVM & Clustering
and confusion matrix?
2. summarize your findings in a table below and discuss your
results
Model TP FP TN FN accuracy TPR TNR
linear SVM
Gaussian SVM
polynomial SVM
Question 3: Take the original dataset with all 3 class labels.
1. for k = 1, 2, . . . , 8 use k-means clustering with random
initialization and defaults. Compute and plot distortion vs
k. Use the ”knee” method to find the best k.
2. re-run your clustering with best k clusters. Pick two features fi and fj at random (using python, of course) and
plot your datapoints (different color for each class and centroids) using fi and fj as axis. Examine your plot. Are
there any interesting patterns?
3. for each cluster, assign a cluster label based on the majority
class of items. For example, if cluster Ci contains 45% of
Page 3
BU MET CS-677: Data Science With Python, v.2.0 CS-677 Assignment: SVM & Clustering
class 1 (”Kama” wheat), 35% of class 2 (”Rosa” wheat) and
20% of class 3 (”Canadian” wheat), then this cluster Ci
is
assigned label 1. For each cluster, print out its centroid and
assigned label.
4. consider the following multi-label classifier. Take the largest
3 clusters with label 1, 2 and 3 respectively. Let us call these
clusters A, B and C. For each of these clusters, you know
their means (centroids): µ(A), µ(B) and µ(C). We now
consider the following procedure (conceptually analogous
to nearest neighbor with k = 1): for every point x in your
dataset, assign a label based on the label on the nearest
(using Euclidean distance) centroid of A, B or C. In other
words, if x is closest to center of cluster A, you assign it
label 1. If x is closest to center of cluster B, you assign it
class 2. Finally, if x is closest to center of cluster C, you
assign it class 3. What is the overall accuracy of this new
classifier when applied to the complete data set?
5. take this new classifier and consider the same two labels that
you used for SVM. What is your accuracy and confusion