Name: Week 5 Lab (Logistic Regression) COSC 3337
SKU: 29672
Price: 30.00 USD
Availability: InStock

Description

5/5 - (11 votes)

About The Data Our goal for this lab is to construct a model that can take a certain set of features related to the Titanic and predict whether a person survived or not (0 or 1). Since we’re trying to predict a binary categorical variable (1 or 0), logistic regression seems like a good place to start from. The dataset that we’ll be using for this task comes from kaggle.com and contains the following attributes: PassengerId Survived (0 or 1) Pclass: Ticket class (1, 2, or 3 where 3 is the lowest class) Name Sex Age: Age in years SibSp: # of siblings / spouses aboard the Titanic Parch: # of parents / children aboard the Titanic Ticket: Ticket number Fare: Passenger fare Cabin: Cabin number Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) Note before starting: Please refer back to the matplotlib lab if you’re having trouble creating any graphs up to this point. You’re free to use any library to create your graphs, so don’t feel like you need to match this code 100%. Exploratory Data Analysis Let’s begin by importing some necessary libraries that we’ll be using to explore the data. Our first step is to load the data into a pandas DataFrame PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S From here, it’s always a good step to use describe() and info() to get a better sense of the data and see if we have any missing values. PassengerId Survived Pclass Age SibSp Parch Fare count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000 mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208 std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429 min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000 25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400 50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200 75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000 max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200 We can see that Age, Cabin, and Embarked contain missing values since this dataset contains 891 entries in total, and Age, Cabin, and Embarked only contain 714 non‑null entries, 204 non‑null entries, and 889 non‑null entries respectively. Thus, we will have to take care of these missing values. <class ’pandas.core.frame.DataFrame’> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns):  #   Column       NonNull Count  Dtype                  0   PassengerId  891 nonnull    int64    1   Survived     891 nonnull    int64    2   Pclass       891 nonnull    int64    3   Name         891 nonnull    object   4   Sex          891 nonnull    object   5   Age          714 nonnull    float64  6   SibSp        891 nonnull    int64    7   Parch        891 nonnull    int64    8   Ticket       891 nonnull    object   9   Fare         891 nonnull    float64  10  Cabin        204 nonnull    object   11  Embarked     889 nonnull    object  dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB Note, we can also make a plot of our missing data if we’d prefer to visualize it. Here we use seaborn’s barplot sns.barplot(x, y) and pass our DataFrame’s columns as the x axis and the sum of all missing values in each column in the y axis. since Embarked only has 2 missing values, it’s very hard to see, but there’s a slight raise in the y axis under Embarked. Tip: If you’re ever confused how a chained line of code works in this course, just break it down into multiple steps. For example, say you didn’t know how the piece of code above ‘y=titanic_data.isnull().sum().values’ gives us all of the missing values. Well, let’s break it down. titanic_data.isnull() gives us back the original DataFrame (titanic_data), but with True and False values placed where there is a missing value. PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 False False False False False False False False False False True False 1 False False False False False False False False False False False False 2 False False False False False False False False False False True False 3 False False False False False False False False False False False False 4 False False False False False False False False False False True False … … … … … … … … … … … … … 886 False False False False False False False False False False True False 887 False False False False False False False False False False False False 888 False False False False False True False False False False True False 889 False False False False False False False False False False False False 890 False False False False False False False False False False True False 891 rows × 12 columns Then calling .sum() off of this gives us back a Series telling us how many true (missing values) were in each column. Recall that True is an alias for 1, which is why we can take the sum of True False columns. PassengerId      0 Survived         0 Pclass           0 Name             0 Sex              0 Age            177 SibSp            0 Parch            0 Ticket           0 Fare             0 Cabin          687 Embarked         2 dtype: int64 Finally, if you remember from lab 3, calling .index on this will give us the index labels (left side), and .values will give us the missing value counts for each column (right side), which is the array that we passed in as y. array([  0,   0,   0,   0,   0, 177,   0,   0,   0,   0, 687,   2]) Keep this tip in mind when exploring other people’s notebooks on github or kaggle, since you’ll soon find out that it’s very common on kaggle for people to chain functions together, which can sometimes be hard to understand at first, but much easier to understand once you break it down into smaller chunks. Let’s continue on with our data exploration by next seeing how many people survived (1) and did not survive (0) in our dataset. To accomplish this, we can pass any column in our DataFrame into sns.countplot(x), which will list all of the unique values in that column along the x‑axis, and plot the total counts for each unique value along the y‑axis. So here we can see that majority of the people in our dataset did not survive (0). Did more men or females survive? Recall that hue parameter seaborn gives us access too. This will let us expand on the previous graph by also telling us how many from each value (0 or 1) were male and female. Interpretation: We can see that from those who did not survive (0), majority of them were male. How about from ticket class? Was the lower class less likely to survive? Interpretation: We can see that from those who did not survive (0), majority of them were from the lower class, 3. What did the Titanic age distribution look like? count    714.000000 mean      29.699118 std       14.526497 min        0.420000 25%       20.125000 50%       28.000000 75%       38.000000 max       80.000000 Name: Age, dtype: float64 Interpretation: The average age on the Titanic seems to be ~30, with 75% of people onboard being 38 years of age or younger. What’s the most common number of siblings one had with them on the Titanic? Interpretation: Majority of those onboard had 0 siblings/spouses also onboard, with the 2nd most popular being having 1 sibling/spouse onboard (most likely that 1 person onboard was a spouse). What was the Fare distribution on the Titanic? How much did the average person pay? count    891.000000 mean      32.204208 std       49.693429 min        0.000000 25%        7.910400 50%       14.454200 75%       31.000000 max      512.329200 Name: Fare, dtype: float64 Interpretation: The average person paid 32.204208, with 75% of people paying 31.000000 or less. One interesting note is that the min is 0. This could mean that there were people unaccounted for who managed to sneak in for free. Or someone who won a free ride or something. Data Preprocessing Let’s first take care of our missing values. Recall how much data was missing: For Age, our best bet would be to impute any missing values with the mean age. We can do this very quickly with pandas .apply(func). This will apply any function to every value along a column. If you’re not familiar with lambda functions, you can create a normal python function that accepts the age and mean_age, and returns the mean age if age is null, or the age itself if it’s not null. Then you can supply that function to .apply(func). So here we’re reassigning the titanic_data[‘Age’] column to titanic_data[‘Age’] after our function has been applied on it, which will essentially fill any missing age values with the mean age calculated. If we recreate our missing data plot, we can see that there are no longer any missing Age values. For Cabin, we have so much data missing (more missing than non‑null data) that performing any type of imputation seems like a bad idea since we don’t have much original data to work with. For this reason, we will just drop this column. I will go ahead and also drop the 2 missing Embarked rows while we’re at it, but you can choose to keep them if you’d like and impute them. Recalling .info(), we can see that there are no more missing values in this dataset. <class ’pandas.core.frame.DataFrame’> Int64Index: 889 entries, 0 to 890 Data columns (total 11 columns):  #   Column       NonNull Count  Dtype                  0   PassengerId  889 nonnull    int64    1   Survived     889 nonnull    int64    2   Pclass       889 nonnull    int64    3   Name         889 nonnull    object   4   Sex          889 nonnull    object   5   Age          889 nonnull    float64  6   SibSp        889 nonnull    int64    7   Parch        889 nonnull    int64    8   Ticket       889 nonnull    object   9   Fare         889 nonnull    float64  10  Embarked     889 nonnull    object  dtypes: float64(2), int64(5), object(4) memory usage: 83.3+ KB Our next step is to handle categorical variables since machine learning algorithms can only understand numbers. The variables to consider are Name, Sex, Ticket, and Embarked. We’ll use dummy variables for Sex and Embarked and drop Name and Ticket. You can choose to do some type of feature engineering on Name and Ticket and compare it with our model without these features if you wish. Recall that a dummy variable is a variable that takes the value 0 or 1 to indicate the absence or presence of some category. Pandas has a convenient function pd.get_dummies(data, columns) that will automatically assign dummy variables for us. For example, if we include Sex in columns, it will create 2 new columns (sex_male, sex_female) and place a 1 for the one that’s true, and 0 in the other. So if a specific observation is female, we will place a 1 in sex_female and 0 in sex_male. One important note is that you should always add an additional drop_first=True parameter when using get_dummies. This will drop one of the columns created in the dummy process, since keeping all of them will result in multicollinearity. Now that our data is in the correct form, we’re ready to begin building our model. PassengerId Survived Pclass Age SibSp Parch Fare Sex_male Embarked_Q Embarked_S 0 1 0 3 22.0 1 0 7.2500 1 0 1 1 2 1 1 38.0 1 0 71.2833 0 0 0 2 3 1 3 26.0 0 0 7.9250 0 0 1 3 4 1 1 35.0 1 0 53.1000 0 0 1 4 5 0 3 35.0 0 0 8.0500 1 0 1 Creating our Logistic Regression Model We’re now ready to begin creating and training our model. We first need to split our data into training and testing sets. This can be done using sklearn’s train_test_split(X, y, test_size) function. This function takes in your features (X), the target variable (y), and the test_size you’d like (Generally a test size of around 0.3 is good enough). It will then return a tuple of X_train, X_test, y_train, y_test sets for us. We will train our model on the training set and then use the test set to evaluate the model. We’ll now import sklearn’s LogisticRegression model and begin training it using the fit(train_data, train_data_labels) method. In a nutshell, fitting is equal to training. Then, after it is trained, the model can be used to make predictions, usually with a predict(test_data) method call. LogisticRegression(max_iter=1000) Model Evaluation Now that we’ve finished training, we can make predictions off of the test data and evaluate our model’s performance using the corresponding test data labels (y_test). Since we’re now dealing with classification, we’ll import sklearn’s classification_report and confusion_matrix to evaluate our model. Both of these take the true values and predictions as parameters.               precision    recall  f1score   support            0       0.85      0.87      0.86       167            1       0.77      0.75      0.76       100     accuracy                           0.82       267    macro avg       0.81      0.81      0.81       267 weighted avg       0.82      0.82      0.82       267 [[145  22]  [ 25  75]] Not bad ! We could certainly do better, but we’ll leave it up to you to mess around with the data some more and see what you can imporove on. You can also check out the actual kaggle competition with the full Titanic dataset and compete on there with your classmates. Kaggle competitions are great for testing your new datascience skills. In [1]: import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns In [2]: from matplotlib import rcParams rcParams[‘figure.figsize’] = 15, 5 sns.set_style(‘darkgrid’) In [3]: titanic_data = pd.read_csv(‘titanic.csv’) titanic_data.head() Out[3]: In [4]: titanic_data.describe() Out[4]: In [5]: titanic_data.info() In [6]: sns.barplot(x=titanic_data.columns, y=titanic_data.isnull().sum().values) plt.xticks(rotation=45) plt.show() In [7]: titanic_data.isnull() Out[7]: In [8]: titanic_data.isnull().sum() Out[8]: In [9]: titanic_data.isnull().sum().values Out[9]: In [10]: sns.countplot(x=titanic_data[‘Survived’]) plt.show() In [11]: sns.countplot(x=titanic_data[‘Survived’], hue=’Sex’, data=titanic_data) plt.show() In [12]: sns.countplot(x=titanic_data[‘Survived’], hue=’Pclass’, data=titanic_data) plt.show() In [13]: sns.histplot(x=titanic_data[‘Age’].dropna()) plt.show() titanic_data[‘Age’].describe() Out[13]: In [14]: sns.countplot(x=titanic_data[‘SibSp’]) plt.show() In [15]: sns.histplot(x=titanic_data[‘Fare’]) plt.show() titanic_data[‘Fare’].describe() Out[15]: In [16]: sns.barplot(x=titanic_data.columns, y=titanic_data.isnull().sum().values) plt.xticks(rotation=45) plt.show() In [17]: mean_age = int(titanic_data[‘Age’].mean()) titanic_data[‘Age’] = titanic_data[‘Age’].apply(lambda age : mean_age if pd.isnull(age) else age) In [18]: sns.barplot(x=titanic_data.columns, y=titanic_data.isnull().sum().values) plt.xticks(rotation=45) plt.show() In [19]: titanic_data.drop(labels=[‘Cabin’], axis=1, inplace=True) titanic_data.dropna(inplace=True) In [20]: titanic_data.info() In [21]: titanic_data = pd.get_dummies(data=titanic_data, columns=[‘Sex’, ‘Embarked’], drop_first=True) titanic_data.drop(labels=[‘Name’,’Ticket’], axis=1, inplace=True) In [22]: titanic_data.head() Out[22]: In [23]: from sklearn.model_selection import train_test_split X = titanic_data[[‘PassengerId’, ‘Pclass’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’,’Sex_male’, ‘Embarked_Q’,                         ’Embarked_S’]] y = titanic_data[‘Survived’] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) In [24]: from sklearn.linear_model import LogisticRegression logmodel = LogisticRegression(max_iter=1000) logmodel.fit(X_train,y_train) Out[24]: In [25]: predictions = logmodel.predict(X_test) In [27]: from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix print(classification_report(y_test,predictions)) print(confusion_matrix(y_test,predictions))

Week 5 Lab (Logistic Regression) COSC 3337

Description

Related products

Week 1 Lab (NumPy) COSC 3337

Week 11 Lab (DBSCAN) COSC 3337

Week 7 Lab (Support Vector Machine) COSC 3337