STATS/CSE 780 Homework Assignment 3


Category: You will Instantly receive a download link for .zip solution file upon Payment


5/5 - (1 vote)


Find a dataset that is suitable for the cluster analysis using the methods covered in class. Some
sites for dataset search are 1) Google Dataset Search, or 2) Kaggle Datasets, or 3) UCI Machine
Learning Repository.

Do not use datasets that have been used in class or collected for your research (not publicly
available) or in the textbooks used in this course or R or Python package data.

The dataset must have at least five variables.
1. Briefly describe your chosen dataset and clearly explain where it was sourced.

2. Carry out a thorough cluster analysis of your chosen data set using:
a. agglomerative or divisive hierarchical clustering;
b. k-means clustering; and
c. k-means or hierarchical clustering after principal component analysis.

3. Your report must include comparison of the clustering results obtained using these methods.
Provide a clear and concise description of the results. Clearly state what conclusions can be drawn
from your analysis in the context of your chosen dataset.

Grading scheme
Grading scheme for all the questions is given below.
1. Source of the dataset [1]
describe your dataset (data types, summaries, outliers, missing value
analysis, etc.) [3]
state the problem to be addressed or explain why the dataset is fit to
clustering [2]
Any data/statistical transformation or any preprocessing for cluster
analysis and principal component analysis [2]

2. a. apply hierarchical clustering, describe choosing the number of clusters,
evaluate the hierarchical clustering [3]
b. apply k-means clustering, describe choosing the number of clusters,
evaluate the k-means clustering [3]
c. apply PCA, choose the number of PCs (and say why), apply k-means
or hierarchical clustering on PCs, describe choosing the number of
clusters, evaluate the clustering results [5]

3. at least two comparisons of the clustering results obtained using these
methods [2]
References Reference list starts on a new page, references are appropriate and list
out in the report [2]
Supplementary material starts on a new page, code readability, all
codes are within the margins, the R codes and the outputs for the
questions are presented [3]
The maximum point for this assignment is 26. We will convert this to 100%.