Description
Overview
In this homework assignment, you will explore, analyze and model a data set containing information on crime
for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime
rate is above the median crime rate (1) or not (0).
Your objective is to build a binary logistic regression model on the training data set to predict whether the
neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the
evaluation data set using your binary logistic regression model. You can only use the variables given to you (or
variables that you derive from the variables provided). Below is a short description of the variables of interest in
the data set:
zn: proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable)
indus: proportion of non-retail business acres per suburb (predictor variable)
chas: a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable)
nox: nitrogen oxides concentration (parts per 10 million) (predictor variable)
rm: average number of rooms per dwelling (predictor variable)
age: proportion of owner-occupied units built prior to 1940 (predictor variable)
dis: weighted mean of distances to five Boston employment centers (predictor variable)
rad: index of accessibility to radial highways (predictor variable)
tax: full-value property-tax rate per $10,000 (predictor variable)
ptratio: pupil-teacher ratio by town (predictor variable)
black: 1000(Bk – 0.63)2 where Bk is the proportion of blacks by town (predictor variable)
lstat: lower status of the population (percent) (predictor variable)
medv: median value of owner-occupied homes in $1000s (predictor variable)
target: whether the crime rate is above the median crime rate (1) or not (0) (response variable)
Deliverables:
A write-up submitted in PDF format. Your write-up should have four sections. Each one is described
below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away
from technical details.
Assigned prediction (probabilities, classifications) for the evaluation data set. Use 0.5 threshold.
Include your R statistical programming code in an Appendix.
Write Up:
1. DATA EXPLORATION (25 Points)
Describe the size and the variables in the crime training data set. Consider that too much detail will cause a
manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some
suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment.
You should have your own thoughts on what to tell the boss. These are just ideas.
a. Mean / Standard Deviation / Median
b. Bar Chart or Box Plot of the data
c. Is the data correlated to the target variable (or to other variables?)
d. Are any of the variables missing and need to be imputed “fixed”?
2. DATA PREPARATION (25 Points)
Describe how you have transformed the data by changing the original variables or creating new variables. If you
did transform the data or create new variables, discuss why you did this. Here are some possible transformations.
a. Fix missing values (maybe with a Mean or Median value)
b. Create flags to suggest if a variable was missing
c. Transform data by putting it into buckets
d. Mathematical transforms such as log or square root (or use Box-Cox)
e. Combine variables (such as ratios or adding or multiplying) to create new variables
3. BUILD MODELS (25 Points)
Using the training data, build at least three different binary logistic regression models, using different variables (or
the same variables with different transformations). You may select the variables manually, use an approach such
as Forward or Stepwise, use a different approach, or use a combination of techniques. Describe the techniques
you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why
this was done.
Be sure to explain how you can make inferences from the model, as well as discuss other relevant model output.
Discuss the coefficients in the models, do they make sense? Are you keeping the model even though it is counter
intuitive? Why? The boss needs to know.
4. SELECT MODELS (25 Points)
Decide on the criteria for selecting the best binary logistic regression model. Will you select models with slightly
worse performance if it makes more sense or is more parsimonious? Discuss why you selected your models.
For the binary logistic regression model, will you use a metric such as log likelihood, AIC, ROC curve, etc.? Using
the training data set, evaluate the binary logistic regression model based on (a) accuracy, (b) classification error
rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. Make predictions
using the evaluation data set.