STA 4364 Midterm 2

$30.00

Category: Tags: , , , , You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (4 votes)

Problem 1: (25 points) This problem will involve logistic regression on the dataset midterm data 2.csv.
The response column is response and all other columns are features.
(a) (5 points) Load the dataset. Remove any unnecessary columns. For any columns that have NA values,
fill in the NA values with the median over all non-missing entries in the columns. Format all columns
with string entries as categorical variables. Make response a categorical variable. Split the dataset
into a training set (75% of observations) and validation set (25% of observations).
(b) (5 points) Make a model using all features. Narrow down your features to make a reduced model that
uses only the most relevant predictors.
(c) (5 points) Create an ROC curve for your full and reduced model on both the training and validation
sets (4 curves in all). Comment on the degree of overfitting for validation performance vs. training
performance and the adequacy of your reduced model compared to your full model.
(d) (5 points) Using your reduced model, perform predictions for P(response = 1|features) for the
validation set. Perform predictions for the binary response by thresholding your predicted probabilities
P(response = 1|features) at two different values: 0.5 and 0.65. Calculate the overall prediction
accuracy for both thresholds. Calculate the False Negative Rate for both thresholds.
(e) (5 points) Make two altered copies of your validation set: one where feat.d is set to 1 for all rows, and
another where feat.d is set to 0 for all rows. All other columns should remain the same as your original
validation set. Using your reduced model, perform predictions for P(response = 1|features) for both
altered validation sets, and average the predicted probabilities across all validation observations (end
up with 2 average probabilities, one for each altered dataset). Finally, calculate the difference between
these average probabilities (either order for the subtraction is OK). How can you interpret the average
difference that you have found?
1