Week 11 Lab (Hierarchical Clustering) COSC 3337

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (10 votes)

About The Data
We’ll be using the Credit Card Dataset from kaggle for this lab, but feel free to follow along with your own dataset. We will develop
a customer segmentation to define marketing strategy. The sample Dataset summarizes the usage behavior of about 9000 active
credit card holders during the last 6 months. The file is at a customer level with 18 behavioral variables:
CUSTID : Identification of Credit Card holder (Categorical)
BALANCE : Balance amount left in their account to make purchases (
BALANCEFREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not
frequently updated)
PURCHASES : Amount of purchases made from account
ONEOFFPURCHASES : Maximum purchase amount done in one‑go
INSTALLMENTSPURCHASES : Amount of purchase done in installment
CASHADVANCE : Cash in advance given by the user
PURCHASESFREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased,
0 = not frequently purchased)
ONEOFFPURCHASESFREQUENCY : How frequently Purchases are happening in one‑go (1 = frequently purchased, 0 = not
frequently purchased)
PURCHASESINSTALLMENTSFREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 =
not frequently done)
CASHADVANCEFREQUENCY : How frequently the cash in advance being paid
CASHADVANCETRX : Number of Transactions made with “Cash in Advanced”
PURCHASESTRX : Numbe of purchase transactions made
CREDITLIMIT : Limit of Credit Card for user
PAYMENTS : Amount of Payment done by user
MINIMUM_PAYMENTS : Minimum amount of payments made by user
PRCFULLPAYMENT : Percent of full payment paid by user
TENURE : Tenure of credit card service for user
About Hierarchical clustering
Hierarchical clustering is a method which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall
into two types:
Agglomerative : This is a “bottom‑up” approach: each observation starts in its own cluster, and pairs of clusters are merged
as one moves up the hierarchy.
Divisive : This is a “top‑down” approach: all observations start in one cluster, and splits are performed recursively as one
moves down the hierarchy.
In general, the merges and splits are determined in a greedy manner, and the results of hierarchical clustering are usually
presented in a dendrogram.
Agglomerative Clustering
The algorithm goes as follows:
Compute the proximity matrix
Let each data point be a cluster
Repeat: Merge two closest clusters and update the proximity matrix until 1/ K cluster remains
For example, say we have six data points {a,b,c,d,e,f}
In the initial step, we consider all the six data points as individual clusters as shown in the image below.
The first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according
to the chosen distance.We construct a distance matrix at this stage, where the number in the i‑th row j‑th column is the distance
between the i‑th and j‑th elements. Then, as clustering progresses, rows and columns are merged as the clusters are merged and
the distances updated.
Computation of Proximity/Distance Matrix
To calculate the proximity between two clusters, we need to define the distance between them. Usually the distance is one of the
following:
The maximum distance between elements of each cluster (also called complete‑linkage clustering )
The minimum distance between elements of each cluster (also called single‑linkage clustering )
The mean distance between elements of each cluster (also called average linkage clustering )
The sum of all intra‑cluster variance.
Note: Euclidean, Manhattan, Mahalanobis, etc. distance formulas can be used when calculating distances for each of the above.
Implementation
Similarly to the K Means lab, we’ll skip the data exploration portion and jump to implementation, but you’re welcome to explore
this data, or your own if working with a different dataset.
Let’s first load the data into a pandas DataFrame. MINIMUM_PAYMENTS and CREDIT_LIMIT are missing a few values, so we’ll go
ahead and fill them with their respective column mean. We’ll also go ahead and drop CUST_ID while we’re at it since it serve’s us
no valuable information.
Checking to see that there are no more missing values:
BALANCE




























0
BALANCE_FREQUENCY


















0
PURCHASES


























0
ONEOFF_PURCHASES



















0
INSTALLMENTS_PURCHASES













0
CASH_ADVANCE























0
PURCHASES_FREQUENCY
















0
ONEOFF_PURCHASES_FREQUENCY









0
PURCHASES_INSTALLMENTS_FREQUENCY



0
CASH_ADVANCE_FREQUENCY













0
CASH_ADVANCE_TRX



















0
PURCHASES_TRX






















0
CREDIT_LIMIT























0
PAYMENTS



























0
MINIMUM_PAYMENTS



















0
PRC_FULL_PAYMENT



















0
TENURE





























0
dtype:
int64
We’ll now standardize and normalize the data as follows:
Next (optional, but recommended), we will reduce the dimensions of the data using PCA
P1 P2
0 ‑0.489826 ‑0.679678
1 ‑0.518791 0.545010
2 0.330885 0.268980
3 ‑0.482374 ‑0.092114
4 ‑0.563289 ‑0.481914
Next, we’ll use a dendrogram to visualize the linkage of the reduced data (X_principal) using method=’ward’.
Note: The cell below might take a while to run.
We can also determine the optimal number of clusters using silhouette score:
We’ll go ahead and build and visualize a clustering model for n_clusters = 3
AgglomerativeClustering(n_clusters=3)
Congrats! Now try repeating the lab steps on your own data for practice.
BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCH
0 40.900749 0.818182 95.40 0.00 95.4 0.000000
1 3202.467416 0.909091 0.00 0.00 0.0 6442.945483
2 2495.148862 1.000000 773.17 773.17 0.0 0.000000
3 1666.670542 0.636364 1499.00 1499.00 0.0 205.788017
4 817.714335 1.000000 16.00 16.00 0.0 0.000000
In
[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In
[2]:
from matplotlib import rcParams
rcParams[‘figure.figsize’] = 15, 5
sns.set_style(‘darkgrid’)
In
[3]:
cc_df = pd.read_csv(‘cc.csv’)
cc_df.drop(‘CUST_ID’, axis = 1, inplace=True)
cc_df[‘MINIMUM_PAYMENTS’].fillna(value=cc_df[‘MINIMUM_PAYMENTS’].mean(), inplace = True)
cc_df[‘CREDIT_LIMIT’].fillna(value=cc_df[‘CREDIT_LIMIT’].mean(), inplace = True)
cc_df.head()
Out[3]: In
[5]:
cc_df.isnull().sum()
Out[5]: In
[6]:
from sklearn.preprocessing import StandardScaler, normalize
scaler = StandardScaler()
scaled_df = scaler.fit_transform(cc_df)

normalized_df = normalize(scaled_df)
#
Converting
the
scaled
data
back
into
a
pandas
DataFrame
normalized_df = pd.DataFrame(data=normalized_df)
In
[10]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_principal = pca.fit_transform(normalized_df)
X_principal = pd.DataFrame(X_principal)
X_principal.columns = [‘P1’, ‘P2’]

X_principal.head()
Out[10]: In
[11]:
import scipy.cluster.hierarchy as shc
plt.title(‘visualising
the
data’)
Dendrogram = shc.dendrogram((shc.linkage(X_principal, method =’ward’)))
In
[12]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
#
where
we’ll
save
scores
for
later
plotting
silhouette_scores = []
#
testing
different
cluster
values
in
range
[2,8﴿
for n_cluster in range(2, 8):
silhouette_scores.append(silhouette_score(X_principal,
AgglomerativeClustering(n_clusters = n_cluster).fit_predict(X_p

#
Creating
bar
graph
to
compare
the
results.
You
can
use
a
line
plot
if
you
prefer
﴾similar
to
K
Means
lab﴿
plt.bar(x=range(2, 8), height=silhouette_scores)
plt.xlabel(‘Number
of
clusters’)
plt.ylabel(‘Silhouette
Score’)
plt.show()
In
[13]:
#
creating
and
fitting
model
agg = AgglomerativeClustering(n_clusters=3)
agg.fit(X_principal)
Out[13]: In
[14]:
#
Visualizing
the
clustering
plt.scatter(X_principal[‘P1’], X_principal[‘P2’],
c = AgglomerativeClustering(n_clusters = 3).fit_predict(X_principal))
plt.show()