Week 5 Lab (Logistic Regression) COSC 3337

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (11 votes)

About The Data Our goal for this lab is to construct a model that can take a certain set of features related to the Titanic and predict whether a person survived or not (0 or 1). Since we’re trying to predict a binary categorical variable (1 or 0), logistic regression seems like a good place to start from. The dataset that we’ll be using for this task comes from kaggle.com and contains the following attributes: PassengerId Survived (0 or 1) Pclass: Ticket class (1, 2, or 3 where 3 is the lowest class) Name Sex Age: Age in years SibSp: # of siblings / spouses aboard the Titanic Parch: # of parents / children aboard the Titanic Ticket: Ticket number Fare: Passenger fare Cabin: Cabin number Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) Note before starting: Please refer back to the matplotlib lab if you’re having trouble creating any graphs up to this point. You’re free to use any library to create your graphs, so don’t feel like you need to match this code 100%. Exploratory Data Analysis Let’s begin by importing some necessary libraries that we’ll be using to explore the data. Our first step is to load the data into a pandas DataFrame PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S From here, it’s always a good step to use describe() and info() to get a better sense of the data and see if we have any missing values. PassengerId Survived Pclass Age SibSp Parch Fare count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000 mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208 std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429 min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000 25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400 50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200 75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000 max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200 We can see that Age, Cabin, and Embarked contain missing values since this dataset contains 891 entries in total, and Age, Cabin, and Embarked only contain 714 non‑null entries, 204 non‑null entries, and 889 non‑null entries respectively. Thus, we will have to take care of these missing values. <class
’pandas.core.frame.DataFrame’> RangeIndex:
891
entries,
0
to
890 Data
columns
(total
12
columns): 
#


Column






Non­Null
Count

Dtype

 ­­­

­­­­­­






­­­­­­­­­­­­­­

­­­­­

 
0


PassengerId

891
non­null



int64

 
1


Survived




891
non­null



int64

 
2


Pclass






891
non­null



int64

 
3


Name








891
non­null



object
 
4


Sex









891
non­null



object
 
5


Age









714
non­null



float64 
6


SibSp







891
non­null



int64

 
7


Parch







891
non­null



int64

 
8


Ticket






891
non­null



object
 
9


Fare








891
non­null



float64 
10

Cabin







204
non­null



object
 
11

Embarked




889
non­null



object
 dtypes:
float64(2),
int64(5),
object(5) memory
usage:
83.7+
KB Note, we can also make a plot of our missing data if we’d prefer to visualize it. Here we use seaborn’s barplot sns.barplot(x, y) and pass our DataFrame’s columns as the x axis and the sum of all missing values in each column in the y axis. since Embarked only has 2 missing values, it’s very hard to see, but there’s a slight raise in the y axis under Embarked. Tip: If you’re ever confused how a chained line of code works in this course, just break it down into multiple steps. For example, say you didn’t know how the piece of code above ‘y=titanic_data.isnull().sum().values’ gives us all of the missing values. Well, let’s break it down. titanic_data.isnull() gives us back the original DataFrame (titanic_data), but with True and False values placed where there is a missing value. PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 False False False False False False False False False False True False 1 False False False False False False False False False False False False 2 False False False False False False False False False False True False 3 False False False False False False False False False False False False 4 False False False False False False False False False False True False … … … … … … … … … … … … … 886 False False False False False False False False False False True False 887 False False False False False False False False False False False False 888 False False False False False True False False False False True False 889 False False False False False False False False False False False False 890 False False False False False False False False False False True False 891 rows × 12 columns Then calling .sum() off of this gives us back a Series telling us how many true (missing values) were in each column. Recall that True is an alias for 1, which is why we can take the sum of True False columns. PassengerId





0 Survived








0 Pclass










0 Name












0 Sex













0 Age











177 SibSp











0 Parch











0 Ticket










0 Fare












0 Cabin









687 Embarked








2 dtype:
int64 Finally, if you remember from lab 3, calling .index on this will give us the index labels (left side), and .values will give us the missing value counts for each column (right side), which is the array that we passed in as y. array([

0,


0,


0,


0,


0,
177,


0,


0,


0,


0,
687,


2]) Keep this tip in mind when exploring other people’s notebooks on github or kaggle, since you’ll soon find out that it’s very common on kaggle for people to chain functions together, which can sometimes be hard to understand at first, but much easier to understand once you break it down into smaller chunks. Let’s continue on with our data exploration by next seeing how many people survived (1) and did not survive (0) in our dataset. To accomplish this, we can pass any column in our DataFrame into sns.countplot(x), which will list all of the unique values in that column along the x‑axis, and plot the total counts for each unique value along the y‑axis. So here we can see that majority of the people in our dataset did not survive (0). Did more men or females survive? Recall that hue parameter seaborn gives us access too. This will let us expand on the previous graph by also telling us how many from each value (0 or 1) were male and female. Interpretation: We can see that from those who did not survive (0), majority of them were male. How about from ticket class? Was the lower class less likely to survive? Interpretation: We can see that from those who did not survive (0), majority of them were from the lower class, 3. What did the Titanic age distribution look like? count



714.000000 mean





29.699118 std






14.526497 min







0.420000 25%






20.125000 50%






28.000000 75%






38.000000 max






80.000000 Name:
Age,
dtype:
float64 Interpretation: The average age on the Titanic seems to be ~30, with 75% of people onboard being 38 years of age or younger. What’s the most common number of siblings one had with them on the Titanic? Interpretation: Majority of those onboard had 0 siblings/spouses also onboard, with the 2nd most popular being having 1 sibling/spouse onboard (most likely that 1 person onboard was a spouse). What was the Fare distribution on the Titanic? How much did the average person pay? count



891.000000 mean





32.204208 std






49.693429 min







0.000000 25%







7.910400 50%






14.454200 75%






31.000000 max





512.329200 Name:
Fare,
dtype:
float64 Interpretation: The average person paid 32.204208, with 75% of people paying 31.000000 or less. One interesting note is that the min is 0. This could mean that there were people unaccounted for who managed to sneak in for free. Or someone who won a free ride or something. Data Preprocessing Let’s first take care of our missing values. Recall how much data was missing: For Age, our best bet would be to impute any missing values with the mean age. We can do this very quickly with pandas .apply(func). This will apply any function to every value along a column. If you’re not familiar with lambda functions, you can create a normal python function that accepts the age and mean_age, and returns the mean age if age is null, or the age itself if it’s not null. Then you can supply that function to .apply(func). So here we’re reassigning the titanic_data[‘Age’] column to titanic_data[‘Age’] after our function has been applied on it, which will essentially fill any missing age values with the mean age calculated. If we recreate our missing data plot, we can see that there are no longer any missing Age values. For Cabin, we have so much data missing (more missing than non‑null data) that performing any type of imputation seems like a bad idea since we don’t have much original data to work with. For this reason, we will just drop this column. I will go ahead and also drop the 2 missing Embarked rows while we’re at it, but you can choose to keep them if you’d like and impute them. Recalling .info(), we can see that there are no more missing values in this dataset. <class
’pandas.core.frame.DataFrame’> Int64Index:
889
entries,
0
to
890 Data
columns
(total
11
columns): 
#


Column






Non­Null
Count

Dtype

 ­­­

­­­­­­






­­­­­­­­­­­­­­

­­­­­

 
0


PassengerId

889
non­null



int64

 
1


Survived




889
non­null



int64

 
2


Pclass






889
non­null



int64

 
3


Name








889
non­null



object
 
4


Sex









889
non­null



object
 
5


Age









889
non­null



float64 
6


SibSp







889
non­null



int64

 
7


Parch







889
non­null



int64

 
8


Ticket






889
non­null



object
 
9


Fare








889
non­null



float64 
10

Embarked




889
non­null



object
 dtypes:
float64(2),
int64(5),
object(4) memory
usage:
83.3+
KB Our next step is to handle categorical variables since machine learning algorithms can only understand numbers. The variables to consider are Name, Sex, Ticket, and Embarked. We’ll use dummy variables for Sex and Embarked and drop Name and Ticket. You can choose to do some type of feature engineering on Name and Ticket and compare it with our model without these features if you wish. Recall that a dummy variable is a variable that takes the value 0 or 1 to indicate the absence or presence of some category. Pandas has a convenient function pd.get_dummies(data, columns) that will automatically assign dummy variables for us. For example, if we include Sex in columns, it will create 2 new columns (sex_male, sex_female) and place a 1 for the one that’s true, and 0 in the other. So if a specific observation is female, we will place a 1 in sex_female and 0 in sex_male. One important note is that you should always add an additional drop_first=True parameter when using get_dummies. This will drop one of the columns created in the dummy process, since keeping all of them will result in multicollinearity. Now that our data is in the correct form, we’re ready to begin building our model. PassengerId Survived Pclass Age SibSp Parch Fare Sex_male Embarked_Q Embarked_S 0 1 0 3 22.0 1 0 7.2500 1 0 1 1 2 1 1 38.0 1 0 71.2833 0 0 0 2 3 1 3 26.0 0 0 7.9250 0 0 1 3 4 1 1 35.0 1 0 53.1000 0 0 1 4 5 0 3 35.0 0 0 8.0500 1 0 1 Creating our Logistic Regression Model We’re now ready to begin creating and training our model. We first need to split our data into training and testing sets. This can be done using sklearn’s train_test_split(X, y, test_size) function. This function takes in your features (X), the target variable (y), and the test_size you’d like (Generally a test size of around 0.3 is good enough). It will then return a tuple of X_train, X_test, y_train, y_test sets for us. We will train our model on the training set and then use the test set to evaluate the model. We’ll now import sklearn’s LogisticRegression model and begin training it using the fit(train_data, train_data_labels) method. In a nutshell, fitting is equal to training. Then, after it is trained, the model can be used to make predictions, usually with a predict(test_data) method call. LogisticRegression(max_iter=1000) Model Evaluation Now that we’ve finished training, we can make predictions off of the test data and evaluate our model’s performance using the corresponding test data labels (y_test). Since we’re now dealing with classification, we’ll import sklearn’s classification_report and confusion_matrix to evaluate our model. Both of these take the true values and predictions as parameters. 













precision



recall

f1­score


support 










0






0.85





0.87





0.86






167 










1






0.77





0.75





0.76






100 



accuracy


























0.82






267 


macro
avg






0.81





0.81





0.81






267 weighted
avg






0.82





0.82





0.82






267 [[145

22] 
[
25

75]] Not bad ! We could certainly do better, but we’ll leave it up to you to mess around with the data some more and see what you can imporove on. You can also check out the actual kaggle competition with the full Titanic dataset and compete on there with your classmates. Kaggle competitions are great for testing your new datascience skills. In
[1]: import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns In
[2]: from matplotlib import rcParams rcParams[‘figure.figsize’] = 15, 5 sns.set_style(‘darkgrid’) In
[3]: titanic_data = pd.read_csv(‘titanic.csv’) titanic_data.head() Out[3]: In
[4]: titanic_data.describe() Out[4]: In
[5]: titanic_data.info() In
[6]: sns.barplot(x=titanic_data.columns, y=titanic_data.isnull().sum().values) plt.xticks(rotation=45) plt.show() In
[7]: titanic_data.isnull() Out[7]: In
[8]: titanic_data.isnull().sum() Out[8]: In
[9]: titanic_data.isnull().sum().values Out[9]: In
[10]: sns.countplot(x=titanic_data[‘Survived’]) plt.show() In
[11]: sns.countplot(x=titanic_data[‘Survived’], hue=’Sex’, data=titanic_data) plt.show() In
[12]: sns.countplot(x=titanic_data[‘Survived’], hue=’Pclass’, data=titanic_data) plt.show() In
[13]: sns.histplot(x=titanic_data[‘Age’].dropna()) plt.show() titanic_data[‘Age’].describe() Out[13]: In
[14]: sns.countplot(x=titanic_data[‘SibSp’]) plt.show() In
[15]: sns.histplot(x=titanic_data[‘Fare’]) plt.show() titanic_data[‘Fare’].describe() Out[15]: In
[16]: sns.barplot(x=titanic_data.columns, y=titanic_data.isnull().sum().values) plt.xticks(rotation=45) plt.show() In
[17]: mean_age = int(titanic_data[‘Age’].mean()) titanic_data[‘Age’] = titanic_data[‘Age’].apply(lambda age : mean_age if pd.isnull(age) else age) In
[18]: sns.barplot(x=titanic_data.columns, y=titanic_data.isnull().sum().values) plt.xticks(rotation=45) plt.show() In
[19]: titanic_data.drop(labels=[‘Cabin’], axis=1, inplace=True) titanic_data.dropna(inplace=True) In
[20]: titanic_data.info() In
[21]: titanic_data = pd.get_dummies(data=titanic_data, columns=[‘Sex’, ‘Embarked’], drop_first=True) titanic_data.drop(labels=[‘Name’,’Ticket’], axis=1, inplace=True) In
[22]: titanic_data.head() Out[22]: In
[23]: from sklearn.model_selection import train_test_split X = titanic_data[[‘PassengerId’, ‘Pclass’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’,’Sex_male’, ‘Embarked_Q’, 























’Embarked_S’]] y = titanic_data[‘Survived’] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) In
[24]: from sklearn.linear_model import LogisticRegression logmodel = LogisticRegression(max_iter=1000) logmodel.fit(X_train,y_train) Out[24]: In
[25]: predictions = logmodel.predict(X_test) In
[27]: from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix print(classification_report(y_test,predictions)) print(confusion_matrix(y_test,predictions))