Description
Overview
In this homework assignment, you will explore, analyze and model a data set containing approximately 8000
records representing a customer at an auto insurance company. Each record has two response variables. The
first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero
means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero
if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.
Your objective is to build multiple linear regression and binary logistic regression models on the training data
to predict the probability that a person will crash their car and also the amount of money it will cost if the person
does crash their car. You can only use the variables given to you (or variables that you derive from the variables
provided). Below is a short description of the variables of interest in the data set:
VARIABLE NAME DEFINITION THEORETICAL EFFECT
INDEX Identification Variable (do not use) None
TARGET_FLAG Was Car in a crash? 1=YES 0=NO None
TARGET_AMT If car was in a crash, what was the cost None
AGE Age of Driver Very young people tend to be risky. Maybe very old people also.
BLUEBOOK Value of Vehicle Unknown effect on probability of collision, but probably effect the payout if there is a crash
CAR_AGE Vehicle Age Unknown effect on probability of collision, but probably effect the payout if there is a crash
CAR_TYPE Type of Car Unknown effect on probability of collision, but probably effect the payout if there is a crash
CAR_USE Vehicle Use Commercial vehicles are driven more, so might increase probability of collision
CLM_FREQ # Claims (Past 5 Years) The more claims you filed in the past, the more you are likely to file in the future
EDUCATION Max Education Level Unknown effect, but in theory more educated people tend to drive more safely
HOMEKIDS # Children at Home Unknown effect
HOME_VAL Home Value In theory, home owners tend to drive more responsibly
INCOME Income In theory, rich people tend to get into fewer crashes
JOB Job Category In theory, white collar jobs tend to be safer
KIDSDRIV # Driving Children When teenagers drive your car, you are more likely to get into crashes
MSTATUS Marital Status In theory, married people drive more safely
MVR_PTS Motor Vehicle Record Points If you get lots of traffic tickets, you tend to get into more crashes
OLDCLAIM Total Claims (Past 5 Years) If your total payout over the past five years was high, this suggests future payouts will be high
PARENT1 Single Parent Unknown effect
RED_CAR A Red Car Urban legend says that red cars (especially red sports cars) are more risky. Is that true?
REVOKED License Revoked (Past 7 Years) If your license was revoked in the past 7 years, you probably are a more risky driver.
SEX Gender Urban legend says that women have less crashes then men. Is that true?
TIF Time in Force People who have been customers for a long time are usually more safe.
TRAVTIME Distance to Work Long drives to work usually suggest greater risk
URBANICITY Home/Work Area Unknown
YOJ Years on Job People who stay at a job for a long time are usually more safe
Deliverables:
A write-up submitted in PDF format. Your write-up should have four sections. Each one is described
below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away
from technical details.
Assigned predictions (probabilities, classifications, cost) for the evaluation data set. Use 0.5 threshold.
Include your R statistical programming code in an Appendix.
Write Up:
1. DATA EXPLORATION (25 Points)
Describe the size and the variables in the insurance training data set. Consider that too much detail will cause a
manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some
suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment.
You should have your own thoughts on what to tell the boss. These are just ideas.
a. Mean / Standard Deviation / Median
b. Bar Chart or Box Plot of the data
c. Is the data correlated to the target variable (or to other variables?)
d. Are any of the variables missing and need to be imputed “fixed”?
2. DATA PREPARATION (25 Points)
Describe how you have transformed the data by changing the original variables or creating new variables. If you
did transform the data or create new variables, discuss why you did this. Here are some possible transformations.
a. Fix missing values (maybe with a Mean or Median value)
b. Create flags to suggest if a variable was missing
c. Transform data by putting it into buckets
d. Mathematical transforms such as log or square root (or use Box-Cox)
e. Combine variables (such as ratios or adding or multiplying) to create new variables
3. BUILD MODELS (25 Points)
Using the training data set, build at least two different multiple linear regression models and three different binary
logistic regression models, using different variables (or the same variables with different transformations). You
may select the variables manually, use an approach such as Forward or Stepwise, use a different approach such
as trees, or use a combination of techniques. Describe the techniques you used. If you manually selected a
variable for inclusion into the model or exclusion into the model, indicate why this was done.
Discuss the coefficients in the models, do they make sense? For example, if a person has a lot of traffic tickets,
you would reasonably expect that person to have more car crashes. If the coefficient is negative (suggesting that
the person is a safer driver), then that needs to be discussed. Are you keeping the model even though it is counter
intuitive? Why? The boss needs to know.
4. SELECT MODELS (25 Points)
Decide on the criteria for selecting the best multiple linear regression model and the best binary logistic regression
model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious?
Discuss why you selected your models.
For the multiple linear regression model, will you use a metric such as Adjusted R2
, RMSE, etc.? Be sure to
explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other
relevant model output. Using the training data set, evaluate the multiple linear regression model based on (a)
mean squared error, (b) R2
, (c) F-statistic, and (d) residual plots. For the binary logistic regression model, will you
use a metric such as log likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the binary logistic
regression model based on (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f)
F1 score, (g) AUC, and (h) confusion matrix. Make predictions using the evaluation data set.