## Description

## Question

Find a dataset that is suitable for the cluster analysis using the methods covered in class. Some

sites for dataset search are 1) Google Dataset Search, or 2) Kaggle Datasets, or 3) UCI Machine

Learning Repository.

Do not use datasets that have been used in class or collected for your research (not publicly

available) or in the textbooks used in this course or R or Python package data.

The dataset must have at least five variables.

1. Briefly describe your chosen dataset and clearly explain where it was sourced.

2. Carry out a thorough cluster analysis of your chosen data set using:

a. agglomerative or divisive hierarchical clustering;

b. k-means clustering; and

c. k-means or hierarchical clustering after principal component analysis.

3. Your report must include comparison of the clustering results obtained using these methods.

Provide a clear and concise description of the results. Clearly state what conclusions can be drawn

from your analysis in the context of your chosen dataset.

Grading scheme

Grading scheme for all the questions is given below.

1. Source of the dataset [1]

describe your dataset (data types, summaries, outliers, missing value

analysis, etc.) [3]

state the problem to be addressed or explain why the dataset is fit to

clustering [2]

Any data/statistical transformation or any preprocessing for cluster

analysis and principal component analysis [2]

2. a. apply hierarchical clustering, describe choosing the number of clusters,

evaluate the hierarchical clustering [3]

b. apply k-means clustering, describe choosing the number of clusters,

evaluate the k-means clustering [3]

c. apply PCA, choose the number of PCs (and say why), apply k-means

or hierarchical clustering on PCs, describe choosing the number of

clusters, evaluate the clustering results [5]

3. at least two comparisons of the clustering results obtained using these

methods [2]

References Reference list starts on a new page, references are appropriate and list

out in the report [2]

Supplementary

material

Supplementary material starts on a new page, code readability, all

codes are within the margins, the R codes and the outputs for the

questions are presented [3]

The maximum point for this assignment is 26. We will convert this to 100%.