Week 7 Lab (KNN) and (Naive Bayes) COSC 3337

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (8 votes)

About The Data In this lab you will learn how to use sklearn to build a machine learning model using k‑Nearest Neighbors algorithm to predict whether the patients in the “Pima Indians Diabetes Dataset” have diabetes or not. The dataset that we’ll be using for this task comes from kaggle.com and contains the following attributes: Pregnancies: Number of times pregnant Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test BloodPressure: Diastolic blood pressure (mm Hg) SkinThickness: Triceps skin fold thickness (mm) Insulin: 2‑Hour serum insulin (mu U/ml) BMI: Body mass index (weight in kg/(height in m)^2) DiabetesPedigreeFunction: Diabetes pedigree function Age (in years) Outcome: Class variable (0 or 1) Exploratory Data Analysis Let’s begin by importing some necessary libraries that we’ll be using to explore the data. Our first step is to load the data into a pandas DataFrame Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome 0 6 148 72 35 0 33.6 0.627 50 1 1 1 85 66 29 0 26.6 0.351 31 0 2 8 183 64 0 0 23.3 0.672 32 1 3 1 89 66 23 94 28.1 0.167 21 0 4 0 137 40 35 168 43.1 2.288 33 1 From here, it’s always a good step to use describe() and info() to get a better sense of the data and see if we have any missing values. <class
’pandas.core.frame.DataFrame’> RangeIndex:
768
entries,
0
to
767 Data
columns
(total
9
columns): 
#


Column



















Non­Null
Count

Dtype

 ­­­

­­­­­­



















­­­­­­­­­­­­­­

­­­­­

 
0


Pregnancies














768
non­null



int64

 
1


Glucose


















768
non­null



int64

 
2


BloodPressure












768
non­null



int64

 
3


SkinThickness












768
non­null



int64

 
4


Insulin


















768
non­null



int64

 
5


BMI






















768
non­null



float64 
6


DiabetesPedigreeFunction

768
non­null



float64 
7


Age






















768
non­null



int64

 
8


Outcome


















768
non­null



int64

 dtypes:
float64(2),
int64(7) memory
usage:
54.1
KB Looking at the info summary, we can see that there are 768 entries in the DataFrame, and 768 non‑null entries in each feature/colum. Thus, there are no missing values, but there is something strange when we look at the describe summary below. For certain columns below, does a value of zero make sense? For example, if an individual had a glucose or blood pressure level of 0, they’d probably be dead, so it’s likely that the true values were excluded from the data for some reason. Therefore, we’ll consider the following columns to have missing values where there’s an invalid zero value: Glucose BloodPressure SkinThickness Insulin BMI Let’s go ahead and replace out invalid zero values with nan, since they technically missing values. We’ll go ahead and make a copy of our diabetes_df and modify the zeros in the copy just incase we need to refer back to the original. We can make copies of DataFrames using .copy(deep=True). There’s also a very convenient function we can call .replace(x, y) that will replace all x values with the y value specified. Before choosing how to impute these missing values, let’s take a look at their distributions. Since SkinThickness, Insulin, and BMI look skewed, we’ll go ahead and replace their missing values with median instead of mean. Glucose and BloodPressure should be ok if we stick with mean for imputing. Recall that mean can be effected by outliers. Let’s first create a heatmap and see if there are any correlations in our dataset. Interpretation: No significant case of multi collinearity is observed. Let’s also check out a few scatterplots of our data. Interpretation: BMI seems to have a slight increase as blood pressure increases. However, majority of the data seems to be centered and cluster at around a blood pressure of 50‑95 and BMI of 20‑45. We’ve also got some outliars scattered around the main cluster. There’s a very subtle increase in diabetes pedigree function as glucose increases. Majority of the data tends to fall between a 75 and 175 glucose level. We also have some outliars with very high diabetes pedigree function and again the zeros outliars which were removed in the no_zeros_df Note: Don’t worry if you can’t replicate the plot to the right. You should have learned about QQ plots in math 3339. In case anyone needs these type of plots or a certain statistical test with p‑values for their project, statsmodels is a great place to find these. Shapiro­Wilk:
 w:0.969902515411377,
p­value:1.7774986343921384e­11 Kolmogorov­Smirnov:
 d:0.969902515411377,
p­value:0.0 Skewness
of
the
data:
 0.531677628850459 Interpretation: The distribution of glucose is unimodal, and appears to be roughly bell shaped, but it’s certainly not a near perfect normal distribution. The provided Q‑Q plot, Shapiro‑Wilk, and Kolmogorov‑Smirnov tests seem to reject the null hypothesis of the data being a normal distribution at the .05 significance level. We can also see both by the graph and provided skewness score (should be about zero for normally distributed data) below that the data has a slight right skew. The distribution peaks at around 120 with most of the data between 100 and 140. How does the glucose distribution of people with diabetes vary from those without? Interpretation: Majority of people in class 0 lie between 93 and 125, where as majority of people in class 1 lie between 119 and 167. With that said, this attribute could serve as a good indicator to determine whether somone is diabetic or not since those in class 1 tend to be in the higher end compared to class 0. I encourage you to go ahead and explore the dataset some more to see if you can find some more interesting points, but I’ll jump to the pre‑processing now since the main goal of this lab is KNN. Pre‑Processing The most important step here is to standardize our data. Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. If this is not taken into account, any variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale. If you recall from math 3339, data Z is rescaled such that and , and is done through this formula: But lucky for us sklearn can do all of this for us. Taking a look at the data again, we see that it is now scaled. Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age 0 0.639947 0.865108 ‑0.033518 0.670643 ‑0.181541 0.166619 0.468492 1.425995 1 ‑0.844885 ‑1.206162 ‑0.529859 ‑0.012301 ‑0.181541 ‑0.852200 ‑0.365061 ‑0.190672 2 1.233880 2.015813 ‑0.695306 ‑0.012301 ‑0.181541 ‑1.332500 0.604397 ‑0.105584 3 ‑0.844885 ‑1.074652 ‑0.529859 ‑0.695245 ‑0.540642 ‑0.633881 ‑0.920763 ‑1.041549 4 ‑1.141852 0.503458 ‑2.680669 0.670643 0.316566 1.549303 5.484909 ‑0.020496 Creating our Model We’re now ready to begin creating and training our model. We first need to split our data into training and testing sets. This can be done using sklearn’s train_test_split(X, y, test_size) function. This function takes in your features (X), the target variable (y), and the test_size you’d like (Generally a test size of around 0.3 is good enough). It will then return a tuple of X_train, X_test, y_train, y_test sets for us. We will train our model on the training set and then use the test set to evaluate the model. The above graph shows that the data is biased towards datapoints having outcome value as 0 (diabetes was not present actually). The number of non‑diabetics is almost twice the number of diabetic patients. This is where an additional parameter stratify can come in handy. Stratified sampling aims at splitting a data set so that each split is similar with respect to something. In a classification setting, it is often chosen to ensure that the train and test sets have approximately the same percentage of samples of each target class as the complete set. Recall from lecture, KNN requires us to find some optimal k value. How we’ll do this is by plotting different k values on the x axis, and the model score for that k value on the y‑axis. Note: You can also plot the error on the y‑axis, which is quite common as well. The best result seems to be captured at k = 11 thus 11 will be used for the final model. At this value our train and test scores don’t vary significantly. 0.7532467532467533 Note: You should also take into account cross validation when considering different models. A separate exercise however will be created covering different cross validation techniques. Not bad, but could be better. See if you can mess with the data and imporve on this score. Lastly, let’s just print out a confusion matrix and classification report of our results. 













precision



recall

f1­score


support 










0






0.79





0.85





0.82






150 










1






0.67





0.58





0.62







81 



accuracy


























0.75






231 


macro
avg






0.73





0.71





0.72






231 weighted
avg






0.75





0.75





0.75






231 [[127

23] 
[
34

47]] Great job! You now know how to use KNeighborsClassifier in sklearn. Try using this on your own dataset and refer back to this lecture if you get stuck. Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 76 mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 In
[1]: import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns In
[2]: from matplotlib import rcParams rcParams[‘figure.figsize’] = 15, 5 sns.set_style(‘darkgrid’) In
[3]: diabetes_df = pd.read_csv(‘diabetes.csv’) diabetes_df.head() Out[3]: In
[4]: diabetes_df.info() In
[5]: diabetes_df.describe() Out[5]: In
[6]: diabetes_df_copy = diabetes_df.copy(deep=True) diabetes_df_copy[‘Glucose’] = diabetes_df_copy[‘Glucose’].replace(0,np.NaN) diabetes_df_copy[‘BloodPressure’] = diabetes_df_copy[‘BloodPressure’].replace(0,np.NaN) diabetes_df_copy[‘SkinThickness’] = diabetes_df_copy[‘SkinThickness’].replace(0,np.NaN) diabetes_df_copy[‘Insulin’] = diabetes_df_copy[‘Insulin’].replace(0,np.NaN) diabetes_df_copy[‘BMI’] = diabetes_df_copy[‘BMI’].replace(0,np.NaN) In
[7]: diabetes_df_copy[[‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’, ‘BMI’]].hist(figsize = (20,10)) plt.show() In
[8]: diabetes_df_copy[‘Glucose’].fillna(diabetes_df_copy[‘Glucose’].mean(), inplace = True) diabetes_df_copy[‘BloodPressure’].fillna(diabetes_df_copy[‘BloodPressure’].mean(), inplace = True) diabetes_df_copy[‘SkinThickness’].fillna(diabetes_df_copy[‘SkinThickness’].median(), inplace = True) diabetes_df_copy[‘Insulin’].fillna(diabetes_df_copy[‘Insulin’].median(), inplace = True) diabetes_df_copy[‘BMI’].fillna(diabetes_df_copy[‘BMI’].median(), inplace = True) In
[9]: sns.heatmap(diabetes_df_copy.corr(), annot=True) plt.title(‘Correlation
Matrix’) plt.show() In
[10]: fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5)) #
alpha
parameter
adjusts
the
point
transparency.
points
with
much
more
overlap
will
appear
darker. sns.scatterplot(x=’BloodPressure’, y=’BMI’, data=diabetes_df_copy, alpha=0.3, ax=axes[0]) axes[0].set_title(‘BloodPressure
VS.
BMI’) sns.scatterplot(x=’Glucose’, y=’DiabetesPedigreeFunction’, data=diabetes_df_copy, alpha=0.3, ax=axes[1]) axes[1].set_title(‘Glucose
VS.
DPF’) plt.show() In
[11]: import statsmodels.api as sm import scipy import pylab fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5)) sns.histplot(diabetes_df_copy[‘Glucose’], ax=axes[0]) axes[0].set_title(‘Glucose
Distribution’) sm.qqplot(diabetes_df_copy[‘Glucose’], line=’s’, ax=axes[1]) axes[1].set_title(‘Glucose
Q­Q
Plot’) pylab.show() w, p_val = scipy.stats.shapiro(diabetes_df_copy[‘Glucose’]) print(‘Shapiro­Wilk:
\nw:{},
p­value:{}\n’.format(w,p_val)) d, p_val = scipy.stats.kstest(diabetes_df_copy[‘Glucose’], ‘norm’) print(‘Kolmogorov­Smirnov:
\nd:{},
p­value:{}\n’.format(w,p_val)) print(‘Skewness
of
the
data:
\n{}\n’.format(scipy.stats.skew(diabetes_df_copy[‘Glucose’]))) In
[12]: class_zero = diabetes_df_copy[(diabetes_df_copy[‘Outcome’] == 0)] class_one = diabetes_df_copy[(diabetes_df_copy[‘Outcome’] == 1)] plt.hist(x=class_zero[‘Glucose’], label=’class
0′, alpha=0.5) plt.hist(x=class_one[‘Glucose’], label=’class
1′, alpha=0.5) plt.legend() plt.title(‘Glucose
Distribution’) plt.show() In
[13]: from sklearn.preprocessing import StandardScaler #
all
columns
except
’Outcome’ X = diabetes_df_copy.drop(‘Outcome’, axis=1) y = diabetes_df_copy[‘Outcome’] #
create
our
scaler
object scaler = StandardScaler() #
use
our
scaler
object
to
transform/scale
our
data
and
save
it
into
X_scaled X_scaled = scaler.fit_transform(X) #
reassign
X
to
a
new
DataFrame
using
the
X_scaled
values. X = pd.DataFrame(data=X_scaled, columns=X.columns) In
[14]: X.head() Out[14]: In
[15]: sns.countplot(x=diabetes_df_copy[‘Outcome’]) plt.show() In
[16]: from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42) In
[17]: from sklearn.neighbors import KNeighborsClassifier #
will
append
scores
here
for
plotting
later test_scores = [] train_scores = [] #
testing
k
values
from
1­14 for i in range(1,15): 



#
create
a
model
with
k=i 



knn = KNeighborsClassifier(i) 



#
train
the
model 



knn.fit(X_train,y_train) 



 



#
append
scores.
 



train_scores.append(knn.score(X_train,y_train)) 



test_scores.append(knn.score(X_test,y_test)) In
[20]: sns.lineplot(x=range(1,15), y=train_scores, marker=’*’, label=’Train
Score’) sns.lineplot(x=range(1,15), y=test_scores, marker=’o’, label=’Test
Score’) plt.title(‘K
vs.
Score’) plt.xlabel(‘K’) plt.ylabel(‘Score’) plt.show() In
[21]: knn = KNeighborsClassifier(11) knn.fit(X_train,y_train) knn.score(X_test,y_test) Out[21]: In
[22]: from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report y_pred = knn.predict(X_test) print(classification_report(y_test,y_pred)) print(confusion_matrix(y_test,y_pred))

Week 7 Lab (Naive Bayes) COSC 3337 Dr. Rizk About The Data We’ll be using the Adult Dataset from kaggle for this lab, but feel free to follow along with your own dataset. The dataset contains the following attributes: age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country income Our goal is to predict whether income exceeds $50k/yr based on census data Exploratory Data Analysis Let’s begin by importing some necessary libraries that we’ll be using to explore the data. Our first step is to load the data into a pandas DataFrame. For some reason, this dataset did not come with a header/column names, so we will specify that when loading the data and manually add the column names ourselves. Calling .info() we can see that there are no missing values in our dataset since there are 32561 entries in total, and 32561 non‑ null entries in every column. <class
’pandas.core.frame.DataFrame’> RangeIndex:
32561
entries,
0
to
32560 Data
columns
(total
15
columns): 
#


Column









Non­Null
Count

Dtype
 ­­­

­­­­­­









­­­­­­­­­­­­­­

­­­­­
 
0


age












32561
non­null

int64
 
1


workclass






32561
non­null

object 
2


fnlwgt









32561
non­null

int64
 
3


education






32561
non­null

object 
4


education_num


32561
non­null

int64
 
5


marital_status

32561
non­null

object 
6


occupation





32561
non­null

object 
7


relationship



32561
non­null

object 
8


race











32561
non­null

object 
9


sex












32561
non­null

object 
10

capital_gain



32561
non­null

int64
 
11

capital_loss



32561
non­null

int64
 
12

hours_per_week

32561
non­null

int64
 
13

native_country

32561
non­null

object 
14

income









32561
non­null

object dtypes:
int64(6),
object(9) memory
usage:
3.7+
MB When working with a lot of variables, it’s usually a good idea to keep track of your categorical and numerical columns in a separate array so that way we can easilly index our dataframe by that array if for some reason we only want to work with the numerical columns. For example, when calculating correlations we only want to work with the numerical columns else we will get an error. Now we can easily explore just categorical or numericals at a time. Let’s begin exploring the categorical variables first. workclass education marital_status occupation relationship race sex native_country income 0 State‑gov Bachelors Never‑married Adm‑clerical Not‑in‑family White Male United‑States <=50K 1 Self‑emp‑not‑inc Bachelors Married‑civ‑spouse Exec‑managerial Husband White Male United‑States <=50K 2 Private HS‑grad Divorced Handlers‑cleaners Not‑in‑family White Male United‑States <=50K 3 Private 11th Married‑civ‑spouse Handlers‑cleaners Husband Black Male United‑States <=50K 4 Private Bachelors Married‑civ‑spouse Prof‑specialty Wife Black Female Cuba <=50K Does one sex tend to earn more than the other in this dataset? Interpretation: majority of our dataset consist of people earning <=50k, but we can see that in both categories (<=50k and >50k), majority of the men earn more. What’s the most common education people in our dataset have? Interpretation: high school, some college, and bachelors degrees seem to be most common in our dataset. Let’s see how many counts of each race we have in this dataset Interpretation: our dataset mostly consists of people from the white race category. Thus, inferences based on race from this dataset could be biased since we do not have enough data from other race categories. What sort of occupations do we have in our dataset, and which are most common? Interpretation: Prof‑specialty, craft‑repair, and Exec‑managerial are the top 3 occupations in our dataset. Also, there’s a ‘?’ signifying unknown. We’ll have to make sure to replace those question marks with null/nan values since these should really be missing values. If you take a look, you’ll see that workclass and native_country also have ‘?’ values, so we’ll replace those with NaN as follows: Note: There was a small space infront of the question mark, so make sure to include that if you’re using the same dataset. After running the cell above, we can see that we have the following missing values, which we’ll have to take care of. Let’s now briefly explore the numerical variables age fnlwgt education_num capital_gain capital_loss hours_per_week 0 39 77516 13 2174 0 40 1 50 83311 13 0 0 13 2 38 215646 9 0 0 40 3 53 234721 7 0 0 40 4 28 338409 13 0 0 40 Let’s check if there are any ‘?’ missing values in any of the numerical columns like we had in the categoricals. We can do this by looping through every variable in the numericals list and printing a note if that column contains a ‘ ?’. Great, there are no missing vlaues to take care of here, we’ll just have to take care of the categorical missing values in later. What do the distributions of our numerical variables look like? I encourage you to go ahead and explore the dataset some more to see if you can find some more interesting points, but I’ll jump to the pre‑processing now since you should be comfortable exploring datasets by now, and the main goal of this lab is to learn how to create and evaluate a Naive Bayes model in sklearn. Pre‑Processing We’ll first take care of the missing categorical values. One option is to replace the missing values with the most frequent/mode, which we’ll do below. However, options for dealing with missing categorical variables include: Remove observations with missing values if we are dealing with a large dataset and the number of records containing missing values are few. Remove the variable/column if it is not significant. Develop a model to predict missing values. KNN for example. Replace missing values with the most frequent in that column. Our next step is to encode these categories. Since our categories don’t really have any type of order to preserve, we’ll use one hot encoding / get dummies. Refer back to lab 5 if you’re having trouble using dummy variables, but we’ll encode as follows: Let’s now map all of our variables onto the same scale. We’ll follow the same steps as the KNN lab. The only difference from KNN lab is that here we’re using RobustScaler, which just scales features using statistics that are robust to outliers. Creating our Model We’re now ready to begin creating and training our model. We first need to split our data into training and testing sets. This can be done using sklearn’s train_test_split(X, y, test_size) function. This function takes in your features (X), the target variable (y), and the test_size you’d like (Generally a test size of around 0.3 is good enough). It will then return a tuple of X_train, X_test, y_train, y_test sets for us. We will train our model on the training set and then use the test set to evaluate the model. GaussianNB() Model Evaluation Now that we’ve finished training, we can make predictions off of the test data and evaluate our model’s performance using the corresponding test data labels (y_test). Check accuracy score: Model
accuracy
score:
0.8228 Compare the train set and test set accuracy: Training
set
score:
0.8241 Test
set
score:
0.8228 The training set accuracy score is 0.8241 while the test set accuracy is 0.8228. These two values are quite comparable, so there is no sign of overfitting. Confussion matrix results: Confusion
matrix 
[[6299
1090] 
[
641
1739]] True
Positives(TP)
=

6299 True
Negatives(TN)
=

1739 False
Positives(FP)
=

1090 False
Negatives(FN)
=

641 Classification report is another way to evaluate the classification model performance. It displays the precision, recall, f1 and support scores for the model. Let’s print these as well. 













precision



recall

f1­score


support 










0






0.91





0.85





0.88





7389 










1






0.61





0.73





0.67





2380 



accuracy


























0.82





9769 


macro
avg






0.76





0.79





0.77





9769 weighted
avg






0.84





0.82





0.83





9769 Let’s also perform k‑Fold Cross Validation (10‑fold below). We can do this using cross_val_score(model, X_train, y_train, k, scoring) Cross­validation
scores:[0.82587719
0.82763158
0.82272927
0.81263712
0.83501536
0.82053532 
0.82404563
0.83457657
0.81000439
0.82316806] Average
cross­validation
score:
0.8236 interpretation: Using the mean cross‑validation, we can conclude that we expect the model to be around 0.8236% accurate on average. If we look at all the 10 scores produced by the 10‑fold cross‑validation, we can also conclude that there is a relatively small variance in the accuracy between folds, so we can conclude that the model is independent of the particular folds used for training. Great job! You now know how to use a Naive Bayes model in sklearn. Try using this on your own dataset and refer back to this lecture if you get stuck. age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capita 0 39 State‑gov 77516 Bachelors 13 Never‑married Adm‑ clerical Not‑in‑ family White Male 2174 1 50 Self‑emp‑ not‑inc 83311 Bachelors 13 Married‑civ‑ spouse Exec‑ managerial Husband White Male 0 2 38 Private 215646 HS‑grad 9 Divorced Handlers‑ cleaners Not‑in‑ family White Male 0 3 53 Private 234721 11th 7 Married‑civ‑ spouse Handlers‑ cleaners Husband Black Male 0 4 28 Private 338409 Bachelors 13 Married‑civ‑ spouse Prof‑ specialty Wife Black Female 0 age fnlwgt education_num capital_gain capital_loss hours_per_week workclass_ Local‑gov workclass_ Never‑ worked workclass_ Private workclass_ Self‑emp‑ inc … 0 39 77516 13 2174 0 40 0 0 0 0 … 1 50 83311 13 0 0 13 0 0 0 0 … 2 38 215646 9 0 0 40 0 0 1 0 … 3 53 234721 7 0 0 40 0 0 1 0 … 4 28 338409 13 0 0 40 0 0 1 0 … 5 rows × 98 columns age fnlwgt education_num capital_gain capital_loss hours_per_week workclass_ Local‑gov workclass_ Never‑ worked workclass_ Private workclass_ Self‑emp‑ inc 0 0.10 ‑0.845803 1.000000 2174.0 0.0 0.0 0 0 0 0 1 0.65 ‑0.797197 1.000000 0.0 0.0 ‑5.4 0 0 0 0 2 0.05 0.312773 ‑0.333333 0.0 0.0 0.0 0 0 1 0 3 0.80 0.472766 ‑1.000000 0.0 0.0 0.0 0 0 1 0 4 ‑0.45 1.342456 1.000000 0.0 0.0 0.0 0 0 1 0 5 rows × 97 columns In
[1]: import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns In
[2]: from matplotlib import rcParams rcParams[‘figure.figsize’] = 15, 5 sns.set_style(‘darkgrid’) In
[3]: adult_df = pd.read_csv(‘adult.csv’, header=None) adult_df.columns = [‘age’, ‘workclass’, ‘fnlwgt’, ‘education’, ‘education_num’, ‘marital_status’, ‘occupatio 



















’relationship’,’race’, ‘sex’, ‘capital_gain’, ‘capital_loss’, ‘hours_per_week’, ‘native_ 



















’income’] adult_df.head() Out[3]: In
[4]: adult_df.info() In
[5]: categoricals = [‘workclass’, ‘education’, ‘marital_status’, ‘occupation’, ‘relationship’, ‘race’, ‘sex’, 















’native_country’, ‘income’] numericals = [‘age’, ‘fnlwgt’, ‘education_num’, ‘capital_gain’, ‘capital_loss’, ‘hours_per_week’] In
[6]: adult_df[categoricals].head() Out[6]: In
[7]: sns.countplot(x=adult_df[‘income’], hue=’sex’, data=adult_df) plt.show() In
[8]: #
order=
is
an
optional
parameter,
which
is
just
sorting
the
bars
in
this
case. sns.countplot(x=adult_df[‘education’], order=adult_df[‘education’].value_counts().index) plt.xticks(rotation=45) plt.show() In
[9]: sns.countplot(x=adult_df[‘race’], data=adult_df) plt.show() In
[10]: sns.countplot(x=adult_df[‘occupation’], data=adult_df, order=adult_df[‘occupation’].value_counts().index) plt.xticks(rotation=45) plt.show() In
[11]: adult_df[‘workclass’] = adult_df[‘workclass’].replace(‘
?’, np.NaN) adult_df[‘occupation’] = adult_df[‘occupation’].replace(‘
?’, np.NaN) adult_df[‘native_country’] = adult_df[‘native_country’].replace(‘
?’, np.NaN) In
[12]: sns.barplot(x=adult_df.columns, y=adult_df.isnull().sum().values) plt.xticks(rotation=45) plt.show() In
[13]: adult_df[numericals].head() Out[13]: In
[14]: for variable in numericals: 



if not adult_df[adult_df[variable] == ‘
?’].empty: 







print(f'{variable}
contains
missing
values
(
?)’) In
[15]: adult_df[numericals].hist(figsize=(20, 10)) plt.show() In
[16]: adult_df[‘workclass’].fillna(adult_df[‘workclass’].mode()[0], inplace=True) adult_df[‘occupation’].fillna(adult_df[‘occupation’].mode()[0], inplace=True) adult_df[‘native_country’].fillna(adult_df[‘native_country’].mode()[0], inplace=True) In
[17]: adult_df = pd.get_dummies(data=adult_df, columns=categoricals, drop_first=True) In
[18]: adult_df.head() Out[18]: In
[26]: from sklearn.preprocessing import RobustScaler #
all
columns
except
our
target
column
for
X X = adult_df.drop(‘income_
>50K’, axis=1) y = adult_df[‘income_
>50K’] #
create
our
scaler
object scaler = RobustScaler() #
use
our
scaler
object
to
transform/scale
our
data
and
save
it
into
X_scaled.
Only
need
to #
transform
numerical
data. X_scaled = scaler.fit_transform(X[numericals]) #
reassign
X[numericals]
to
the
transformed
numerical
data. X[numericals] = X_scaled In
[27]: X.head() Out[27]: In
[28]: from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) In
[29]: from sklearn.naive_bayes import GaussianNB #
instantiate
the
model
to
train
a
Gaussian
Naive
Bayes
classifier gnb = GaussianNB() #
fit
the
model gnb.fit(X_train, y_train) Out[29]: In
[30]: y_pred = gnb.predict(X_test) In
[31]: from sklearn.metrics import accuracy_score print(‘Model
accuracy
score:
{0:0.4f}’.format(accuracy_score(y_test, y_pred))) In
[32]: y_pred_train = gnb.predict(X_train) print(‘Training
set
score:
{:.4f}’.format(gnb.score(X_train, y_train))) print(‘Test
set
score:
{:.4f}’.format(gnb.score(X_test, y_test))) In
[33]: from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) print(‘Confusion
matrix\n\n’, cm) print(‘\nTrue
Positives(TP)
=
’, cm[0,0]) print(‘\nTrue
Negatives(TN)
=
’, cm[1,1]) print(‘\nFalse
Positives(FP)
=
’, cm[0,1]) print(‘\nFalse
Negatives(FN)
=
’, cm[1,0]) In
[34]: from sklearn.metrics import classification_report print(classification_report(y_test, y_pred)) In
[37]: from sklearn.model_selection import cross_val_score #
Applying
10­Fold
Cross
Validation scores = cross_val_score(gnb, X_train, y_train, cv = 10, scoring=’accuracy’) print(‘Cross­validation
scores:{}’.format(scores)) #
compute
Average
cross­validation
score print(‘\nAverage
cross­validation
score:
{:.4f}’.format(scores.mean())) Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js