Description
About The Data In this lab you will learn how to use sklearn to build a machine learning model using k‑Nearest Neighbors algorithm to predict whether the patients in the “Pima Indians Diabetes Dataset” have diabetes or not. The dataset that we’ll be using for this task comes from kaggle.com and contains the following attributes: Pregnancies: Number of times pregnant Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test BloodPressure: Diastolic blood pressure (mm Hg) SkinThickness: Triceps skin fold thickness (mm) Insulin: 2‑Hour serum insulin (mu U/ml) BMI: Body mass index (weight in kg/(height in m)^2) DiabetesPedigreeFunction: Diabetes pedigree function Age (in years) Outcome: Class variable (0 or 1) Exploratory Data Analysis Let’s begin by importing some necessary libraries that we’ll be using to explore the data. Our first step is to load the data into a pandas DataFrame Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome 0 6 148 72 35 0 33.6 0.627 50 1 1 1 85 66 29 0 26.6 0.351 31 0 2 8 183 64 0 0 23.3 0.672 32 1 3 1 89 66 23 94 28.1 0.167 21 0 4 0 137 40 35 168 43.1 2.288 33 1 From here, it’s always a good step to use describe() and info() to get a better sense of the data and see if we have any missing values. <class ’pandas.core.frame.DataFrame’> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column NonNull Count Dtype 0 Pregnancies 768 nonnull int64 1 Glucose 768 nonnull int64 2 BloodPressure 768 nonnull int64 3 SkinThickness 768 nonnull int64 4 Insulin 768 nonnull int64 5 BMI 768 nonnull float64 6 DiabetesPedigreeFunction 768 nonnull float64 7 Age 768 nonnull int64 8 Outcome 768 nonnull int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB Looking at the info summary, we can see that there are 768 entries in the DataFrame, and 768 non‑null entries in each feature/colum. Thus, there are no missing values, but there is something strange when we look at the describe summary below. For certain columns below, does a value of zero make sense? For example, if an individual had a glucose or blood pressure level of 0, they’d probably be dead, so it’s likely that the true values were excluded from the data for some reason. Therefore, we’ll consider the following columns to have missing values where there’s an invalid zero value: Glucose BloodPressure SkinThickness Insulin BMI Let’s go ahead and replace out invalid zero values with nan, since they technically missing values. We’ll go ahead and make a copy of our diabetes_df and modify the zeros in the copy just incase we need to refer back to the original. We can make copies of DataFrames using .copy(deep=True). There’s also a very convenient function we can call .replace(x, y) that will replace all x values with the y value specified. Before choosing how to impute these missing values, let’s take a look at their distributions. Since SkinThickness, Insulin, and BMI look skewed, we’ll go ahead and replace their missing values with median instead of mean. Glucose and BloodPressure should be ok if we stick with mean for imputing. Recall that mean can be effected by outliers. Let’s first create a heatmap and see if there are any correlations in our dataset. Interpretation: No significant case of multi collinearity is observed. Let’s also check out a few scatterplots of our data. Interpretation: BMI seems to have a slight increase as blood pressure increases. However, majority of the data seems to be centered and cluster at around a blood pressure of 50‑95 and BMI of 20‑45. We’ve also got some outliars scattered around the main cluster. There’s a very subtle increase in diabetes pedigree function as glucose increases. Majority of the data tends to fall between a 75 and 175 glucose level. We also have some outliars with very high diabetes pedigree function and again the zeros outliars which were removed in the no_zeros_df Note: Don’t worry if you can’t replicate the plot to the right. You should have learned about QQ plots in math 3339. In case anyone needs these type of plots or a certain statistical test with p‑values for their project, statsmodels is a great place to find these. ShapiroWilk: w:0.969902515411377, pvalue:1.7774986343921384e11 KolmogorovSmirnov: d:0.969902515411377, pvalue:0.0 Skewness of the data: 0.531677628850459 Interpretation: The distribution of glucose is unimodal, and appears to be roughly bell shaped, but it’s certainly not a near perfect normal distribution. The provided Q‑Q plot, Shapiro‑Wilk, and Kolmogorov‑Smirnov tests seem to reject the null hypothesis of the data being a normal distribution at the .05 significance level. We can also see both by the graph and provided skewness score (should be about zero for normally distributed data) below that the data has a slight right skew. The distribution peaks at around 120 with most of the data between 100 and 140. How does the glucose distribution of people with diabetes vary from those without? Interpretation: Majority of people in class 0 lie between 93 and 125, where as majority of people in class 1 lie between 119 and 167. With that said, this attribute could serve as a good indicator to determine whether somone is diabetic or not since those in class 1 tend to be in the higher end compared to class 0. I encourage you to go ahead and explore the dataset some more to see if you can find some more interesting points, but I’ll jump to the pre‑processing now since the main goal of this lab is KNN. Pre‑Processing The most important step here is to standardize our data. Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. If this is not taken into account, any variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale. If you recall from math 3339, data Z is rescaled such that and , and is done through this formula: But lucky for us sklearn can do all of this for us. Taking a look at the data again, we see that it is now scaled. Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age 0 0.639947 0.865108 ‑0.033518 0.670643 ‑0.181541 0.166619 0.468492 1.425995 1 ‑0.844885 ‑1.206162 ‑0.529859 ‑0.012301 ‑0.181541 ‑0.852200 ‑0.365061 ‑0.190672 2 1.233880 2.015813 ‑0.695306 ‑0.012301 ‑0.181541 ‑1.332500 0.604397 ‑0.105584 3 ‑0.844885 ‑1.074652 ‑0.529859 ‑0.695245 ‑0.540642 ‑0.633881 ‑0.920763 ‑1.041549 4 ‑1.141852 0.503458 ‑2.680669 0.670643 0.316566 1.549303 5.484909 ‑0.020496 Creating our Model We’re now ready to begin creating and training our model. We first need to split our data into training and testing sets. This can be done using sklearn’s train_test_split(X, y, test_size) function. This function takes in your features (X), the target variable (y), and the test_size you’d like (Generally a test size of around 0.3 is good enough). It will then return a tuple of X_train, X_test, y_train, y_test sets for us. We will train our model on the training set and then use the test set to evaluate the model. The above graph shows that the data is biased towards datapoints having outcome value as 0 (diabetes was not present actually). The number of non‑diabetics is almost twice the number of diabetic patients. This is where an additional parameter stratify can come in handy. Stratified sampling aims at splitting a data set so that each split is similar with respect to something. In a classification setting, it is often chosen to ensure that the train and test sets have approximately the same percentage of samples of each target class as the complete set. Recall from lecture, KNN requires us to find some optimal k value. How we’ll do this is by plotting different k values on the x axis, and the model score for that k value on the y‑axis. Note: You can also plot the error on the y‑axis, which is quite common as well. The best result seems to be captured at k = 11 thus 11 will be used for the final model. At this value our train and test scores don’t vary significantly. 0.7532467532467533 Note: You should also take into account cross validation when considering different models. A separate exercise however will be created covering different cross validation techniques. Not bad, but could be better. See if you can mess with the data and imporve on this score. Lastly, let’s just print out a confusion matrix and classification report of our results. precision recall f1score support 0 0.79 0.85 0.82 150 1 0.67 0.58 0.62 81 accuracy 0.75 231 macro avg 0.73 0.71 0.72 231 weighted avg 0.75 0.75 0.75 231 [[127 23] [ 34 47]] Great job! You now know how to use KNeighborsClassifier in sklearn. Try using this on your own dataset and refer back to this lecture if you get stuck. Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 76 mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 In [1]: import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns In [2]: from matplotlib import rcParams rcParams[‘figure.figsize’] = 15, 5 sns.set_style(‘darkgrid’) In [3]: diabetes_df = pd.read_csv(‘diabetes.csv’) diabetes_df.head() Out[3]: In [4]: diabetes_df.info() In [5]: diabetes_df.describe() Out[5]: In [6]: diabetes_df_copy = diabetes_df.copy(deep=True) diabetes_df_copy[‘Glucose’] = diabetes_df_copy[‘Glucose’].replace(0,np.NaN) diabetes_df_copy[‘BloodPressure’] = diabetes_df_copy[‘BloodPressure’].replace(0,np.NaN) diabetes_df_copy[‘SkinThickness’] = diabetes_df_copy[‘SkinThickness’].replace(0,np.NaN) diabetes_df_copy[‘Insulin’] = diabetes_df_copy[‘Insulin’].replace(0,np.NaN) diabetes_df_copy[‘BMI’] = diabetes_df_copy[‘BMI’].replace(0,np.NaN) In [7]: diabetes_df_copy[[‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’, ‘BMI’]].hist(figsize = (20,10)) plt.show() In [8]: diabetes_df_copy[‘Glucose’].fillna(diabetes_df_copy[‘Glucose’].mean(), inplace = True) diabetes_df_copy[‘BloodPressure’].fillna(diabetes_df_copy[‘BloodPressure’].mean(), inplace = True) diabetes_df_copy[‘SkinThickness’].fillna(diabetes_df_copy[‘SkinThickness’].median(), inplace = True) diabetes_df_copy[‘Insulin’].fillna(diabetes_df_copy[‘Insulin’].median(), inplace = True) diabetes_df_copy[‘BMI’].fillna(diabetes_df_copy[‘BMI’].median(), inplace = True) In [9]: sns.heatmap(diabetes_df_copy.corr(), annot=True) plt.title(‘Correlation Matrix’) plt.show() In [10]: fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5)) # alpha parameter adjusts the point transparency. points with much more overlap will appear darker. sns.scatterplot(x=’BloodPressure’, y=’BMI’, data=diabetes_df_copy, alpha=0.3, ax=axes[0]) axes[0].set_title(‘BloodPressure VS. BMI’) sns.scatterplot(x=’Glucose’, y=’DiabetesPedigreeFunction’, data=diabetes_df_copy, alpha=0.3, ax=axes[1]) axes[1].set_title(‘Glucose VS. DPF’) plt.show() In [11]: import statsmodels.api as sm import scipy import pylab fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5)) sns.histplot(diabetes_df_copy[‘Glucose’], ax=axes[0]) axes[0].set_title(‘Glucose Distribution’) sm.qqplot(diabetes_df_copy[‘Glucose’], line=’s’, ax=axes[1]) axes[1].set_title(‘Glucose QQ Plot’) pylab.show() w, p_val = scipy.stats.shapiro(diabetes_df_copy[‘Glucose’]) print(‘ShapiroWilk: \nw:{}, pvalue:{}\n’.format(w,p_val)) d, p_val = scipy.stats.kstest(diabetes_df_copy[‘Glucose’], ‘norm’) print(‘KolmogorovSmirnov: \nd:{}, pvalue:{}\n’.format(w,p_val)) print(‘Skewness of the data: \n{}\n’.format(scipy.stats.skew(diabetes_df_copy[‘Glucose’]))) In [12]: class_zero = diabetes_df_copy[(diabetes_df_copy[‘Outcome’] == 0)] class_one = diabetes_df_copy[(diabetes_df_copy[‘Outcome’] == 1)] plt.hist(x=class_zero[‘Glucose’], label=’class 0′, alpha=0.5) plt.hist(x=class_one[‘Glucose’], label=’class 1′, alpha=0.5) plt.legend() plt.title(‘Glucose Distribution’) plt.show() In [13]: from sklearn.preprocessing import StandardScaler # all columns except ’Outcome’ X = diabetes_df_copy.drop(‘Outcome’, axis=1) y = diabetes_df_copy[‘Outcome’] # create our scaler object scaler = StandardScaler() # use our scaler object to transform/scale our data and save it into X_scaled X_scaled = scaler.fit_transform(X) # reassign X to a new DataFrame using the X_scaled values. X = pd.DataFrame(data=X_scaled, columns=X.columns) In [14]: X.head() Out[14]: In [15]: sns.countplot(x=diabetes_df_copy[‘Outcome’]) plt.show() In [16]: from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42) In [17]: from sklearn.neighbors import KNeighborsClassifier # will append scores here for plotting later test_scores = [] train_scores = [] # testing k values from 114 for i in range(1,15): # create a model with k=i knn = KNeighborsClassifier(i) # train the model knn.fit(X_train,y_train) # append scores. train_scores.append(knn.score(X_train,y_train)) test_scores.append(knn.score(X_test,y_test)) In [20]: sns.lineplot(x=range(1,15), y=train_scores, marker=’*’, label=’Train Score’) sns.lineplot(x=range(1,15), y=test_scores, marker=’o’, label=’Test Score’) plt.title(‘K vs. Score’) plt.xlabel(‘K’) plt.ylabel(‘Score’) plt.show() In [21]: knn = KNeighborsClassifier(11) knn.fit(X_train,y_train) knn.score(X_test,y_test) Out[21]: In [22]: from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report y_pred = knn.predict(X_test) print(classification_report(y_test,y_pred)) print(confusion_matrix(y_test,y_pred))
Week 7 Lab (Naive Bayes) COSC 3337 Dr. Rizk About The Data We’ll be using the Adult Dataset from kaggle for this lab, but feel free to follow along with your own dataset. The dataset contains the following attributes: age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country income Our goal is to predict whether income exceeds $50k/yr based on census data Exploratory Data Analysis Let’s begin by importing some necessary libraries that we’ll be using to explore the data. Our first step is to load the data into a pandas DataFrame. For some reason, this dataset did not come with a header/column names, so we will specify that when loading the data and manually add the column names ourselves. Calling .info() we can see that there are no missing values in our dataset since there are 32561 entries in total, and 32561 non‑ null entries in every column. <class ’pandas.core.frame.DataFrame’> RangeIndex: 32561 entries, 0 to 32560 Data columns (total 15 columns): # Column NonNull Count Dtype 0 age 32561 nonnull int64 1 workclass 32561 nonnull object 2 fnlwgt 32561 nonnull int64 3 education 32561 nonnull object 4 education_num 32561 nonnull int64 5 marital_status 32561 nonnull object 6 occupation 32561 nonnull object 7 relationship 32561 nonnull object 8 race 32561 nonnull object 9 sex 32561 nonnull object 10 capital_gain 32561 nonnull int64 11 capital_loss 32561 nonnull int64 12 hours_per_week 32561 nonnull int64 13 native_country 32561 nonnull object 14 income 32561 nonnull object dtypes: int64(6), object(9) memory usage: 3.7+ MB When working with a lot of variables, it’s usually a good idea to keep track of your categorical and numerical columns in a separate array so that way we can easilly index our dataframe by that array if for some reason we only want to work with the numerical columns. For example, when calculating correlations we only want to work with the numerical columns else we will get an error. Now we can easily explore just categorical or numericals at a time. Let’s begin exploring the categorical variables first. workclass education marital_status occupation relationship race sex native_country income 0 State‑gov Bachelors Never‑married Adm‑clerical Not‑in‑family White Male United‑States <=50K 1 Self‑emp‑not‑inc Bachelors Married‑civ‑spouse Exec‑managerial Husband White Male United‑States <=50K 2 Private HS‑grad Divorced Handlers‑cleaners Not‑in‑family White Male United‑States <=50K 3 Private 11th Married‑civ‑spouse Handlers‑cleaners Husband Black Male United‑States <=50K 4 Private Bachelors Married‑civ‑spouse Prof‑specialty Wife Black Female Cuba <=50K Does one sex tend to earn more than the other in this dataset? Interpretation: majority of our dataset consist of people earning <=50k, but we can see that in both categories (<=50k and >50k), majority of the men earn more. What’s the most common education people in our dataset have? Interpretation: high school, some college, and bachelors degrees seem to be most common in our dataset. Let’s see how many counts of each race we have in this dataset Interpretation: our dataset mostly consists of people from the white race category. Thus, inferences based on race from this dataset could be biased since we do not have enough data from other race categories. What sort of occupations do we have in our dataset, and which are most common? Interpretation: Prof‑specialty, craft‑repair, and Exec‑managerial are the top 3 occupations in our dataset. Also, there’s a ‘?’ signifying unknown. We’ll have to make sure to replace those question marks with null/nan values since these should really be missing values. If you take a look, you’ll see that workclass and native_country also have ‘?’ values, so we’ll replace those with NaN as follows: Note: There was a small space infront of the question mark, so make sure to include that if you’re using the same dataset. After running the cell above, we can see that we have the following missing values, which we’ll have to take care of. Let’s now briefly explore the numerical variables age fnlwgt education_num capital_gain capital_loss hours_per_week 0 39 77516 13 2174 0 40 1 50 83311 13 0 0 13 2 38 215646 9 0 0 40 3 53 234721 7 0 0 40 4 28 338409 13 0 0 40 Let’s check if there are any ‘?’ missing values in any of the numerical columns like we had in the categoricals. We can do this by looping through every variable in the numericals list and printing a note if that column contains a ‘ ?’. Great, there are no missing vlaues to take care of here, we’ll just have to take care of the categorical missing values in later. What do the distributions of our numerical variables look like? I encourage you to go ahead and explore the dataset some more to see if you can find some more interesting points, but I’ll jump to the pre‑processing now since you should be comfortable exploring datasets by now, and the main goal of this lab is to learn how to create and evaluate a Naive Bayes model in sklearn. Pre‑Processing We’ll first take care of the missing categorical values. One option is to replace the missing values with the most frequent/mode, which we’ll do below. However, options for dealing with missing categorical variables include: Remove observations with missing values if we are dealing with a large dataset and the number of records containing missing values are few. Remove the variable/column if it is not significant. Develop a model to predict missing values. KNN for example. Replace missing values with the most frequent in that column. Our next step is to encode these categories. Since our categories don’t really have any type of order to preserve, we’ll use one hot encoding / get dummies. Refer back to lab 5 if you’re having trouble using dummy variables, but we’ll encode as follows: Let’s now map all of our variables onto the same scale. We’ll follow the same steps as the KNN lab. The only difference from KNN lab is that here we’re using RobustScaler, which just scales features using statistics that are robust to outliers. Creating our Model We’re now ready to begin creating and training our model. We first need to split our data into training and testing sets. This can be done using sklearn’s train_test_split(X, y, test_size) function. This function takes in your features (X), the target variable (y), and the test_size you’d like (Generally a test size of around 0.3 is good enough). It will then return a tuple of X_train, X_test, y_train, y_test sets for us. We will train our model on the training set and then use the test set to evaluate the model. GaussianNB() Model Evaluation Now that we’ve finished training, we can make predictions off of the test data and evaluate our model’s performance using the corresponding test data labels (y_test). Check accuracy score: Model accuracy score: 0.8228 Compare the train set and test set accuracy: Training set score: 0.8241 Test set score: 0.8228 The training set accuracy score is 0.8241 while the test set accuracy is 0.8228. These two values are quite comparable, so there is no sign of overfitting. Confussion matrix results: Confusion matrix [[6299 1090] [ 641 1739]] True Positives(TP) = 6299 True Negatives(TN) = 1739 False Positives(FP) = 1090 False Negatives(FN) = 641 Classification report is another way to evaluate the classification model performance. It displays the precision, recall, f1 and support scores for the model. Let’s print these as well. precision recall f1score support 0 0.91 0.85 0.88 7389 1 0.61 0.73 0.67 2380 accuracy 0.82 9769 macro avg 0.76 0.79 0.77 9769 weighted avg 0.84 0.82 0.83 9769 Let’s also perform k‑Fold Cross Validation (10‑fold below). We can do this using cross_val_score(model, X_train, y_train, k, scoring) Crossvalidation scores:[0.82587719 0.82763158 0.82272927 0.81263712 0.83501536 0.82053532 0.82404563 0.83457657 0.81000439 0.82316806] Average crossvalidation score: 0.8236 interpretation: Using the mean cross‑validation, we can conclude that we expect the model to be around 0.8236% accurate on average. If we look at all the 10 scores produced by the 10‑fold cross‑validation, we can also conclude that there is a relatively small variance in the accuracy between folds, so we can conclude that the model is independent of the particular folds used for training. Great job! You now know how to use a Naive Bayes model in sklearn. Try using this on your own dataset and refer back to this lecture if you get stuck. age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capita 0 39 State‑gov 77516 Bachelors 13 Never‑married Adm‑ clerical Not‑in‑ family White Male 2174 1 50 Self‑emp‑ not‑inc 83311 Bachelors 13 Married‑civ‑ spouse Exec‑ managerial Husband White Male 0 2 38 Private 215646 HS‑grad 9 Divorced Handlers‑ cleaners Not‑in‑ family White Male 0 3 53 Private 234721 11th 7 Married‑civ‑ spouse Handlers‑ cleaners Husband Black Male 0 4 28 Private 338409 Bachelors 13 Married‑civ‑ spouse Prof‑ specialty Wife Black Female 0 age fnlwgt education_num capital_gain capital_loss hours_per_week workclass_ Local‑gov workclass_ Never‑ worked workclass_ Private workclass_ Self‑emp‑ inc … 0 39 77516 13 2174 0 40 0 0 0 0 … 1 50 83311 13 0 0 13 0 0 0 0 … 2 38 215646 9 0 0 40 0 0 1 0 … 3 53 234721 7 0 0 40 0 0 1 0 … 4 28 338409 13 0 0 40 0 0 1 0 … 5 rows × 98 columns age fnlwgt education_num capital_gain capital_loss hours_per_week workclass_ Local‑gov workclass_ Never‑ worked workclass_ Private workclass_ Self‑emp‑ inc 0 0.10 ‑0.845803 1.000000 2174.0 0.0 0.0 0 0 0 0 1 0.65 ‑0.797197 1.000000 0.0 0.0 ‑5.4 0 0 0 0 2 0.05 0.312773 ‑0.333333 0.0 0.0 0.0 0 0 1 0 3 0.80 0.472766 ‑1.000000 0.0 0.0 0.0 0 0 1 0 4 ‑0.45 1.342456 1.000000 0.0 0.0 0.0 0 0 1 0 5 rows × 97 columns In [1]: import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns In [2]: from matplotlib import rcParams rcParams[‘figure.figsize’] = 15, 5 sns.set_style(‘darkgrid’) In [3]: adult_df = pd.read_csv(‘adult.csv’, header=None) adult_df.columns = [‘age’, ‘workclass’, ‘fnlwgt’, ‘education’, ‘education_num’, ‘marital_status’, ‘occupatio ’relationship’,’race’, ‘sex’, ‘capital_gain’, ‘capital_loss’, ‘hours_per_week’, ‘native_ ’income’] adult_df.head() Out[3]: In [4]: adult_df.info() In [5]: categoricals = [‘workclass’, ‘education’, ‘marital_status’, ‘occupation’, ‘relationship’, ‘race’, ‘sex’, ’native_country’, ‘income’] numericals = [‘age’, ‘fnlwgt’, ‘education_num’, ‘capital_gain’, ‘capital_loss’, ‘hours_per_week’] In [6]: adult_df[categoricals].head() Out[6]: In [7]: sns.countplot(x=adult_df[‘income’], hue=’sex’, data=adult_df) plt.show() In [8]: # order= is an optional parameter, which is just sorting the bars in this case. sns.countplot(x=adult_df[‘education’], order=adult_df[‘education’].value_counts().index) plt.xticks(rotation=45) plt.show() In [9]: sns.countplot(x=adult_df[‘race’], data=adult_df) plt.show() In [10]: sns.countplot(x=adult_df[‘occupation’], data=adult_df, order=adult_df[‘occupation’].value_counts().index) plt.xticks(rotation=45) plt.show() In [11]: adult_df[‘workclass’] = adult_df[‘workclass’].replace(‘ ?’, np.NaN) adult_df[‘occupation’] = adult_df[‘occupation’].replace(‘ ?’, np.NaN) adult_df[‘native_country’] = adult_df[‘native_country’].replace(‘ ?’, np.NaN) In [12]: sns.barplot(x=adult_df.columns, y=adult_df.isnull().sum().values) plt.xticks(rotation=45) plt.show() In [13]: adult_df[numericals].head() Out[13]: In [14]: for variable in numericals: if not adult_df[adult_df[variable] == ‘ ?’].empty: print(f'{variable} contains missing values ( ?)’) In [15]: adult_df[numericals].hist(figsize=(20, 10)) plt.show() In [16]: adult_df[‘workclass’].fillna(adult_df[‘workclass’].mode()[0], inplace=True) adult_df[‘occupation’].fillna(adult_df[‘occupation’].mode()[0], inplace=True) adult_df[‘native_country’].fillna(adult_df[‘native_country’].mode()[0], inplace=True) In [17]: adult_df = pd.get_dummies(data=adult_df, columns=categoricals, drop_first=True) In [18]: adult_df.head() Out[18]: In [26]: from sklearn.preprocessing import RobustScaler # all columns except our target column for X X = adult_df.drop(‘income_ >50K’, axis=1) y = adult_df[‘income_ >50K’] # create our scaler object scaler = RobustScaler() # use our scaler object to transform/scale our data and save it into X_scaled. Only need to # transform numerical data. X_scaled = scaler.fit_transform(X[numericals]) # reassign X[numericals] to the transformed numerical data. X[numericals] = X_scaled In [27]: X.head() Out[27]: In [28]: from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) In [29]: from sklearn.naive_bayes import GaussianNB # instantiate the model to train a Gaussian Naive Bayes classifier gnb = GaussianNB() # fit the model gnb.fit(X_train, y_train) Out[29]: In [30]: y_pred = gnb.predict(X_test) In [31]: from sklearn.metrics import accuracy_score print(‘Model accuracy score: {0:0.4f}’.format(accuracy_score(y_test, y_pred))) In [32]: y_pred_train = gnb.predict(X_train) print(‘Training set score: {:.4f}’.format(gnb.score(X_train, y_train))) print(‘Test set score: {:.4f}’.format(gnb.score(X_test, y_test))) In [33]: from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) print(‘Confusion matrix\n\n’, cm) print(‘\nTrue Positives(TP) = ’, cm[0,0]) print(‘\nTrue Negatives(TN) = ’, cm[1,1]) print(‘\nFalse Positives(FP) = ’, cm[0,1]) print(‘\nFalse Negatives(FN) = ’, cm[1,0]) In [34]: from sklearn.metrics import classification_report print(classification_report(y_test, y_pred)) In [37]: from sklearn.model_selection import cross_val_score # Applying 10Fold Cross Validation scores = cross_val_score(gnb, X_train, y_train, cv = 10, scoring=’accuracy’) print(‘Crossvalidation scores:{}’.format(scores)) # compute Average crossvalidation score print(‘\nAverage crossvalidation score: {:.4f}’.format(scores.mean())) Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js