DDA3020 Homework 4

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment

Description

5/5 - (1 vote)

1 Written Problems (50 points)

1. (10 points) Show that the M step for maximum likelihood estimation of GMM(Gaussian
Mixture Model)

2. (20 points) Consider a set of binary samples indexed by i = 1, …, m+ for positive class and
j = 1, …, m− for negative class.
• Introduce the calculation principles of precision, recall, roc, and auc respectively.
• When the distribution of positive and negative samples is uneven, which one is more
affected, ROC or PR curve? Please explain mathematically.

3. (20 points) Consider the following 10 data points: X = {(1, 0, 2, −3, −2),(0, 1, −3, −2, −3),
(1, 2, 1, 3, −2),(−1, 1, 2, 3, −1),(1, 0, 1, −1, 1),(2, 3, −1, 1, −2),(−2, 3, −3, 2, 3), (−2, −2, 2, 3, −2),
(−2, −2, 1, −3, −3),(−3, 2, 0, −1, −2)}. Compute the unit-length principle components of X
and choose two of them for PCA, then calculate the projection of each data on these two principal components. You could use python or matlab to obtain eigenvectors and eigenvalues.
end

2 Programming (50 points + (10 points bonus))

Task Description:
In this programming task, you are required to
1. Implement 2 clustering algorithms from scratch (i.e., no third-party or off-the-shelf package or library are allowed): (1) K-means; (2) Gaussian Mixture Models (implement
GMMs based on the EM algorithm discussed in class).

2. Implement 3 evaluation metrics from scratch (i.e., not calling off-the-shelf package) to evaluate the performance of above clustering algorithms, including Silhouette Coefficient, Rand
Index and Normalized Mutual Information (NMI). Silhouette Coefficient and Rand
Index have been introduced in the course, and you are required to implement a new metric
Normalized Mutual Information, which will be introduced briefly in the below text.

3. Complete Task 2.1 and Task 2.2, and save and submit the assignments according to the
requirements in the submission format.

4. (6 points bonus) For students who are willing to and successfully implement accelerated Kmeans with triangle-inequality (refer to https://www.aaai.org/Papers/ICML/2003/ICML03-022.
pdf), they can receive a bonus of 8 points (Please place it in a separate file, and the total
score not exceeding the overall score of this assignment).

5. (4 points bonus) We encourage conducting additional experimental analyses on the models,
such as testing the sensitivity to clustering initialization for each algorithm (e.g., running a
clustering algorithm with random initialization multiple times and calculating the variance
of evaluation metrics). For students who provide comprehensive analyses, maintain a wellorganized format and excellent presentation report, an additional bonus of 5 points can be
awarded.

Task 2.1
Implement K-means and GMMs, as well as the 3 evaluation metrics: Silhouette Coefficient, Rand
Index, and NMI.

Task 2.2
Plotting a figure that illustrate the values of the Silhouette Coefficient and Rand Index for different
values of k.

Choose the optimal number of clusters k based on the Silhouette Coefficient and Rand Index (Please
note that for this assignment, we do not consider the complex scenario of combining them to find
an optimal k. In other words, we expect you to apply the Silhouette Coefficient and Rand Index
separately to independently determine the optimal cluster number (k) for each metric.).

Then, calculate and output the NMI value for the selected k for each of the 2 metrics (Silhouette
Coefficient and Rand Index).

Introduction to Normalized Mutual Information (NMI): Normalized Mutual Information
(NMI) is a metric commonly used in information theory to quantify the similarity or dependence
between two sets of data, often denoted as X and Y . It is particularly useful in clustering and
classification tasks to assess the quality of partitioning or grouping of data points.

The formula for calculating NMI involves three key components: Mutual Information (I(X; Y )),
Entropy of X (H(X)), and Entropy of Y (H(Y )). The NMI is computed using the following formula:
NMI(X, Y ) = I(X; Y )
p
H(X) · H(Y )
,
where I(X; Y ) is the Mutual Information between X and Y
I(X; Y ) = X
x∈X
X
y∈Y
P(x, y) · log 
P(x, y)
P(x) · P(y)

,
H(X) and H(Y ) are the Entropy of X and Y respectively,
H(X) = −
X
x∈X
P(x) · log(P(x))

In this experiment, you are required to implement NMI and use the results obtained from your
implemented clustering algorithm to calculate the NMI value for the results. Specifically, assuming
you have the true class label set Y and the clustering algorithm-generated class label set X, the
NMI formula allows you to assess the performance of the clustering algorithm on the data.

Datasets
We will be working with 2 datasets in this assignment:
1. seeds: 210 instances, 7 features, and 3 classes.
2. Vowel: 990 instances, 10 features, and 11 classes.

Submission Format:
1. For task 2.1, you are required to submit a code file (either a Python script or a Jupyter
Notebook). The code should include implementations of K-means and GMMs, as
well as the implementations of Silhouette Coefficient, Rand Index, and NMI. The
execution of the code should output the value of Silhouette Coefficient for clustering number
k = 2 and k = 3, as well as Rand Index and NMI for clustering numbers k ranging from 1 to
3 on 2 datasets.

2. For task 2.2, you need to submit a Jupyter Notebook file. The code should encompass the
procedures outlined in task 2.2. The experiment is to be conducted solely on the Vowel
dataset. You are free to define the range for selecting the k values, but please ensure that it
includes at least 5 values for k.

3. For students seeking bonus points, you can keep the bonus-related scripts (accelerated Kmeans with triangle-inequality and additional experimental analyses) in a separate Jupyter
Notebook or .py file.

Please name your code file as A4 yourstudentID 2.1, A4 yourstudentID 2.2
(and A4 yourstudentID bonus for those who complete bonus tasks), while A4 yourstudentID is
your student ID number. For example, if your ID is 230056789 and submit 2 Jupyter Notebooks for
task2.1 and task2.2, then you should submit A4 230056789 2.1.ipynb and A4 230056789 2.2.ipynb.
And again, please note that readable code is essential. If your code lacks comments and is
intensively intricate, making it difficult to read, you may lose points.