Description
Exercise 1 (longley Macroeconomic Data)
The built-in dataset longley contains macroeconomic data for predicting employment. We will attempt to
model the Employed variable.
View(longley)
?longley
(a) What is the largest correlation between any pair of predictors in the dataset?
(b) Fit a model with Employed as the response and the remaining variables as predictors. Calculate and
report the variance inflation factor (VIF) for each of the predictors. Which variable has the largest VIF? Do
any of the VIFs suggest multicollinearity?
(c) What proportion of the observed variation in Population is explained by a linear relationship with the
other predictors?
(d) Calculate the partial correlation coefficient for Population and Employed with the effects of the
other predictors removed.
(e) Fit a new model with Employed as the response and the predictors from the model in (b) that were
significant. (Use α = 0.05.) Calculate and report the variance inflation factor for each of the predictors.
Which variable has the largest VIF? Do any of the VIFs suggest multicollinearity?
(f) Use an F-test to compare the models in parts (b) and (e). Report the following:
• The null hypothesis
• The test statistic
• The distribution of the test statistic under the null hypothesis
• The p-value
• A decision
• Which model you prefer, (b) or (e)
(g) Check the assumptions of the model chosen in part (f). Do any assumptions appear to be violated?
1
Exercise 2 (Credit Data)
For this exercise, use the Credit data from the ISLR package. Use the following code to remove the ID
variable which is not useful for modeling.
library(ISLR)
data(Credit)
Credit = subset(Credit, select = -c(ID))
Use ?Credit to learn about this dataset.
(a) Find a “good” model for balance using the available predictors. Use any methods seen in class except
transformations of the response. The model should:
• Reach a LOOCV-RMSE below 140
• Obtain an adjusted R2 above 0.90
• Fail to reject the Breusch-Pagan test with an α of 0.01
• Use fewer than 10 β parameters
Store your model in a variable called mod_a. Run the two given chunks to verify your model meets the
requested criteria. If you cannot find a model that meets all criteria, partial credit will be given for meeting
at least some of the criteria.
library(lmtest)
get_bp_decision = function(model, alpha) {
decide = unname(bptest(model)$p.value < alpha)
ifelse(decide, “Reject”, “Fail to Reject”)
}
get_sw_decision = function(model, alpha) {
decide = unname(shapiro.test(resid(model))$p.value < alpha)
ifelse(decide, “Reject”, “Fail to Reject”)
}
get_num_params = function(model) {
length(coef(model))
}
get_loocv_rmse = function(model) {
sqrt(mean((resid(model) / (1 – hatvalues(model))) ^ 2))
}
get_adj_r2 = function(model) {
summary(model)$adj.r.squared
}
get_loocv_rmse(mod_a)
get_adj_r2(mod_a)
get_bp_decision(mod_a, alpha = 0.01)
get_num_params(mod_a)
(b) Find another “good” model for balance using the available predictors. Use any methods seen in class
except transformations of the response. The model should:
2
• Reach a LOOCV-RMSE below 130
• Obtain an adjusted R2 above 0.85
• Fail to reject the Shapiro-Wilk test with an α of 0.01
• Use fewer than 25 β parameters
Store your model in a variable called mod_b. Run the two given chunks to verify your model meets the
requested criteria. If you cannot find a model that meets all criteria, partial credit will be given for meeting
at least some of the criteria.
library(lmtest)
get_bp_decision = function(model, alpha) {
decide = unname(bptest(model)$p.value < alpha)
ifelse(decide, “Reject”, “Fail to Reject”)
}
get_sw_decision = function(model, alpha) {
decide = unname(shapiro.test(resid(model))$p.value < alpha)
ifelse(decide, “Reject”, “Fail to Reject”)
}
get_num_params = function(model) {
length(coef(model))
}
get_loocv_rmse = function(model) {
sqrt(mean((resid(model) / (1 – hatvalues(model))) ^ 2))
}
get_adj_r2 = function(model) {
summary(model)$adj.r.squared
}
get_loocv_rmse(mod_b)
get_adj_r2(mod_b)
get_sw_decision(mod_b, alpha = 0.01)
get_num_params(mod_b)
Exercise 3 (Sacramento Housing Data)
For this exercise, use the Sacramento data from the caret package. Use the following code to perform some
preprocessing of the data.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
3
library(ggplot2)
data(Sacramento)
sac_data = Sacramento
sac_data$limits = factor(ifelse(sac_data$city == “SACRAMENTO”, “in”, “out”))
sac_data = subset(sac_data, select = -c(city, zip))
Instead of using the city or zip variables that exist in the dataset, we will simply create a variable (limits)
indicating whether or not a house is technically within the city limits of Sacramento. (We do this because
they would both be factor variables with a large number of levels. This is a choice that is made due to
laziness, not necessarily because it is justified. Think about what issues these variables might cause.)
Use ?Sacramento to learn more about this dataset.
A plot of longitude versus latitude gives us a sense of where the city limits are.
qplot(y = longitude, x = latitude, data = sac_data,
col = limits, main = “Sacramento City Limits “)
−121.4
−121.2
−121.0
−120.8
−120.6
38.4 38.6 38.8 39.0
latitude
longitude
limits
in
out
Sacramento City Limits
After these modifications, we test-train split the data.
set.seed(420)
sac_trn_idx = sample(nrow(sac_data), size = trunc(0.80 * nrow(sac_data)))
sac_trn_data = sac_data[sac_trn_idx, ]
sac_tst_data = sac_data[-sac_trn_idx, ]
The training data should be used for all model fitting. Our goal is to find a model that is useful for predicting
home prices.
4
(a) Find a “good” model for price. Use any methods seen in class. The model should reach a LOOCVRMSE below 77,500 in the training data. Do not use any transformations of the response variable.
(b) Is a model that achieves a LOOCV-RMSE below 77,500 useful in this case? That is, is an average error
of 77,500 low enough when predicting home prices? To further investigate, use the held-out test data and
your model from part (a) to do two things:
• Calculate the average percent error:
1
n
X
i
|predictedi − actuali
|
predictedi
× 100
• Plot the predicted versus the actual values and add the line y = x.
Based on all of this information, argue whether or not this model is useful.
Exercise 4 (Does It Work?)
In this exercise, we will investigate how well backwards AIC and BIC actually perform. For either to be
“working” correctly, they should result in a low number of both false positives and false negatives. In
model selection,
• False Positive, FP: Incorrectly including a variable in the model. Including a non-significant variable
• False Negative, FN: Incorrectly excluding a variable in the model. Excluding a significant variable
Consider the true model
Y = β0 + β1×1 + β2×2 + β3×3 + β4×4 + β5×5 + β6×6 + β7×7 + β8×8 + β9×9 + β10×10 +
where ∼ N(0, σ2 = 4). The true values of the β parameters are given in the R code below.
beta_0 = 1
beta_1 = -1
beta_2 = 2
beta_3 = -2
beta_4 = 1
beta_5 = 1
beta_6 = 0
beta_7 = 0
beta_8 = 0
beta_9 = 0
beta_10 = 0
sigma = 2
Then, as we have specified them, some variables are significant, and some are not. We store their names in
R variables for use later.
not_sig = c(“x_6”, “x_7”, “x_8”, “x_9”, “x_10”)
signif = c(“x_1”, “x_2”, “x_3”, “x_4”, “x_5”)
We now simulate values for these x variables, which we will use throughout part (a).
5
set.seed(420)
n = 100
x_1 = runif(n, 0, 10)
x_2 = runif(n, 0, 10)
x_3 = runif(n, 0, 10)
x_4 = runif(n, 0, 10)
x_5 = runif(n, 0, 10)
x_6 = runif(n, 0, 10)
x_7 = runif(n, 0, 10)
x_8 = runif(n, 0, 10)
x_9 = runif(n, 0, 10)
x_10 = runif(n, 0, 10)
We then combine these into a data frame and simulate y according to the true model.
sim_data_1 = data.frame(x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10,
y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 +
beta_5 * x_5 + rnorm(n, 0 , sigma)
)
We do a quick check to make sure everything looks correct.
head(sim_data_1)
## x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 y
## 1 6.055 4.088 8.7894 1.8180 0.8198 8.146 9.7305 9.6673 6.915 4.5523 -11.627
## 2 9.703 3.634 5.0768 5.5784 6.3193 6.033 3.2301 2.6707 2.214 0.4861 -0.147
## 3 1.745 3.899 0.5431 4.5068 1.0834 3.427 3.2223 5.2746 8.242 7.2310 15.145
## 4 4.758 5.315 7.6257 0.1287 9.4057 6.168 0.2472 6.5325 2.102 4.5814 2.404
## 5 7.245 7.225 9.5763 3.0398 0.4194 5.937 9.2169 4.6228 2.527 9.2349 -7.910
## 6 8.761 5.177 1.7983 0.5949 9.2944 9.392 1.0017 0.4476 5.508 5.9687 9.764
Now, we fit an incorrect model.
fit = lm(y ~ x_1 + x_2 + x_6 + x_7, data = sim_data_1)
coef(fit)
## (Intercept) x_1 x_2 x_6 x_7
## -1.3758 -0.3572 2.1040 0.1344 -0.3367
Notice, we have coefficients for x_1, x_2, x_6, and x_7. This means that x_6 and x_7 are false positives,
while x_3, x_4, and x_5 are false negatives.
To detect the false negatives, use:
# which are false negatives?
!(signif %in% names(coef(fit)))
## [1] FALSE FALSE TRUE TRUE TRUE
To detect the false positives, use:
6
# which are false positives?
names(coef(fit)) %in% not_sig
## [1] FALSE FALSE FALSE TRUE TRUE
Note that in both cases, you could sum() the result to obtain the number of false negatives or positives.
(a) Set a seed equal to your birthday; then, using the given data for each x variable above in sim_data_1,
simulate the response variable y 300 times. Each time,
• Fit an additive model using each of the x variables.
• Perform variable selection using backwards AIC.
• Perform variable selection using backwards BIC.
• Calculate and store the number of false negatives for the models chosen by AIC and BIC.
• Calculate and store the number of false positives for the models chosen by AIC and BIC.
Calculate the rate of false positives and negatives for both AIC and BIC. Compare the rates between the
two methods. Arrange your results in a well formatted table.
(b) Set a seed equal to your birthday; then, using the given data for each x variable below in sim_data_2,
simulate the response variable y 300 times. Each time,
• Fit an additive model using each of the x variables.
• Perform variable selection using backwards AIC.
• Perform variable selection using backwards BIC.
• Calculate and store the number of false negatives for the models chosen by AIC and BIC.
• Calculate and store the number of false positives for the models chosen by AIC and BIC.
Calculate the rate of false positives and negatives for both AIC and BIC. Compare the rates between the
two methods. Arrange your results in a well formatted table. Also compare to your answers in part (a) and
suggest a reason for any differences.
set.seed(94)
x_1 = runif(n, 0, 10)
x_2 = runif(n, 0, 10)
x_3 = runif(n, 0, 10)
x_4 = runif(n, 0, 10)
x_5 = runif(n, 0, 10)
x_6 = runif(n, 0, 10)
x_7 = runif(n, 0, 10)
x_8 = x_1 + rnorm(n, 0, 0.1)
x_9 = x_1 + rnorm(n, 0, 0.1)
x_10 = x_2 + rnorm(n, 0, 0.1)
sim_data_2 = data.frame(x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10,
y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 +
beta_5 * x_5 + rnorm(n, 0 , sigma)
)
7