~~$40.00~~ $24.00

Category: DSCC 465

Description

5/5 - (3 votes)

1) [20 points] Download the dataset called ‘country_information.xlsx’ that can be found under

the ‘Data’ tab on BlackBoard. Do the following:

a. [10 points] Provide a summary of what the dataset is about (around 100 words) by

checking the variable names.

b. [10 points] Excluding the ‘country’ column, apply 0-1 normalization on the numeric

columns. Save the resulting dataset as:

‘country_information_normalized.xlsx’ [Note: Do not forget to add the ‘country’ column

to the normalized dataset. For normalization, you can use a package.]

2) [20 points] Code the kmeans++ algorithm from scratch. For more information about the

individual steps of the algorithm, please check here:

https://en.wikipedia.org/wiki/K-means%2B%2B.

As input, your algorithm should take a numpy matrix or a pandas dataframe and a k value

that denotes the expected number of clusters. The output needs to be the labels associated

with feature vectors coming from your dataset.

Note: You are welcome to use pre-packaged algorithms to calculate distances and means. If

you need to pick a point randomly, please do the following:

i. Import the random package of Python.

ii. Set seed to 265 by running the following line: random.seed(265) [This should be

done at the very beginning of your code file, after importing the packages.]

iii. Run the following line: randrange(0,len(name_of_your_dataset),1).

Use the resulting the number as the index number for the data point that should be

randomly picked in different stages of the kmeans++ algorithm.

For the remainder of the analysis, use the ‘country_information_normalized.xlsx’ dataset you

created in Q1.

3) [20 points] Now, we will test the code we have written in Q2 and apply dimension reduction:

Specifically, do the following:

a. [10 points]. Set the random seed to 265 again (to (re-)guarantee the same initialization).

Set k = 6. Run your kmeans++ code on the ‘country_information_normalized.xlsx’ dataset

by excluding the ‘country’ column.

Record the labels. Attach the labels as a new column to your dataset by naming your new

variable as kmeans_label.

b. [10 points] Excluding the ‘country’ and ‘kmeans_label’ columns, run dimension reduction

(specifically PCA) on your dataset by using sklearn’s PCA function: https://scikitlearn.org/stable/modules/generated/sklearn.decomposition.PCA.html.

Spring 2022: Int. to Statistical Machine Learning University of Rochester

3

[Note: set n_components = 2 and random_state = 265. Other parameters

should be left as ‘default’.]. Add the new variables in your dataset as pca_dim_1 and

pca_dim_2.

For the next question, use the attached ‘visualization_code.py’ file.

4) [20 points] Now, let’s visualize the results, use the clustering labels to color our data points,

and present them in convex hulls. Run the code provided to you in the

‘visualization_code.py’ file. Change the name of the dataset where it says […]. Add the visual

to your .pdf submission.

Note: For this exercise, you will need to find and explore the required packages that will need

to be imported. The resulting plot should look (somewhat) similar to what is below (but, you

will have k = 6).

5) [20 points] Interpret the results (in around 300 words) by answering the following:

a. [5 points] Which countries seem to be similar? Why do you think these countries are

clustered together?

b. [5 points] If you run the kmeans++ algorithm more than once, do you think the results

will change?

c. [5 points] (Subjectively speaking) Do you think this is an accurate clustering of the

countries? Would the results change greatly if we had different social/economic

variables?

d. [5 points] Do you think PCA may have affected the results at all? In other words, if we had

a different number of principle components, would our visual interpretation be different?

WhatsApp us