Description
Exercise 1 (Using lm)
For this exercise we will use the data stored in nutrition-2018.csv. It contains the nutritional values per
serving size for a large variety of foods as calculated by the USDA in 2018. It is a cleaned version totaling
5956 observations and is current as of April 2018.
The variables in the dataset are:
• ID
• Desc – short description of food
• Water – in grams
• Calories
• Protein – in grams
• Fat – in grams
• Carbs – carbohydrates, in grams
• Fiber – in grams
• Sugar – in grams
• Calcium – in milligrams
• Potassium – in milligrams
• Sodium – in milligrams
• VitaminC – vitamin C, in milligrams
• Chol – cholesterol, in milligrams
• Portion – description of standard serving size used in analysis
(a) Fit the following multiple linear regression model in R. Use Calories as the response and Fat, Sugar,
and Sodium as predictors.
Yi = β0 + β1xi1 + β2xi2 + β3xi3 + i
.
Here,
1
• Yi
is Calories.
• xi1 is Fat.
• xi2 is Sugar.
• xi3 is Sodium.
Use an F-test to test the significance of the regression. Report the following:
• The null and alternative hypotheses
• The value of the test statistic
• The p-value of the test
• A statistical decision at α = 0.01
• A conclusion in the context of the problem
When reporting these, you should explicitly state them in your document, not assume that a reader will find
and interpret them from a large block of R output.
(b) Output only the estimated regression coefficients. Interpret all βˆ
j coefficients in the context of the
problem.
(c) Use your model to predict the number of Calories in a Big Mac. According to McDonald’s publicized
nutrition facts, the Big Mac contains 30g of fat, 9g of sugar, and 1010mg of sodium.
(d) Calculate the standard deviation, sy, for the observed values in the Calories variable. Report the value
of se from your multiple regression model. Interpret both estimates in the context of this problem.
(e) Report the value of R2
for the model. Interpret its meaning in the context of the problem.
(f) Calculate a 90% confidence interval for β2. Give an interpretation of the interval in the context of the
problem.
(g) Calculate a 95% confidence interval for β0. Give an interpretation of the interval in the context of the
problem.
(h) Use a 99% confidence interval to estimate the mean Calorie content of a food with 23g of fat, 0g of
sugar, and 400mg of sodium, which is true of a large order of McDonald’s french fries. Interpret the interval
in context.
(i) Use a 99% prediction interval to predict the Calorie content of a Crunchwrap Supreme, which has 21g of
fat, 6g of sugar, and 1200mg of sodium according to Taco Bell’s publicized nutrition information. Interpret
the interval in context.
Exercise 2 (More lm for Multiple Regression)
For this exercise we will use the data stored in goalies.csv. It contains career data for 462 players in the
National Hockey League who played goaltender at some point up to and including the 2014-2015 season.
The variables in the dataset are:
• W – Wins
• GA – Goals Against
• SA – Shots Against
• SV – Saves
• SV_PCT – Save Percentage
• GAA – Goals Against Average
• SO – Shutouts
2
• MIN – Minutes
• PIM – Penalties in Minutes
For this exercise we will consider three models, each with Wins as the response. The predictors for these
models are:
• Model 1: Goals Against, Saves
• Model 2: Goals Against, Saves, Shots Against, Minutes, Shutouts
• Model 3: All Available
(a) Use an F-test to compares Models 1 and 2. Report the following:
• The null hypothesis
• The value of the test statistic
• The p-value of the test
• A statistical decision at α = 0.05
• The model you prefer
(b) Use an F-test to compare Model 3 to your preferred model from part (a). Report the following:
• The null hypothesis
• The value of the test statistic
• The p-value of the test
• A statistical decision at α = 0.05
• The model you prefer
(c) Use a t-test to test H0 : βSV = 0 vs H1 : βSV 6= 0 for the model you preferred in part (b). Report the
following:
• The value of the test statistic
• The p-value of the test
• A statistical decision at α = 0.05
Exercise 3 (Regression without lm)
For this exercise we will once again use the Ozone data from the mlbench package. The goal of this exercise
is to fit a model with ozone as the response and the remaining variables as predictors.
data(Ozone, package = “mlbench”)
Ozone = Ozone[, c(4, 6, 7, 8)]
colnames(Ozone) = c(“ozone”, “wind”, “humidity”, “temp”)
Ozone = Ozone[complete.cases(Ozone), ]
(a) Obtain the estimated regression coefficients without the use of lm() or any other built-in functions for
regression. That is, you should use only matrix operations. Store the results in a vector beta_hat_no_lm.
To ensure this is a vector, you may need to use as.vector(). Return this vector as well as the results of
sum(beta_hat_no_lm ˆ 2).
3
(b) Obtain the estimated regression coefficients with the use of lm(). Store the results in a vector
beta_hat_lm. To ensure this is a vector, you may need to use as.vector(). Return this vector as well as
the results of sum(beta_hat_lm ˆ 2).
(c) Use the all.equal() function to verify that the results are the same. You may need to remove the
names of one of the vectors. The as.vector() function will do this as a side effect, or you can directly use
unname().
(d) Calculate se without the use of lm(). That is, continue with your results from (a) and perform additional
matrix operations to obtain the result. Output this result. Also, verify that this result is the same as the
result obtained from lm().
(e) Calculate R2 without the use of lm(). That is, continue with your results from (a) and (d), and perform
additional operations to obtain the result. Output this result. Also, verify that this result is the same as the
result obtained from lm().
Exercise 4 (Regression for Prediction)
For this exercise use the Auto dataset from the ISLR package. Use ?Auto to learn about the dataset. The
goal of this exercise is to find a model that is useful for predicting the response mpg. We remove the name
variable as it is not useful for this analysis. (Also, this is an easier to load version of data from the textbook.)
# load required package, remove “name” variable
library(ISLR)
Auto = subset(Auto, select = -c(name))
When evaluating a model for prediction, we often look at RMSE. However, if we both fit the model with all
the data as well as evaluate RMSE using all the data, we’re essentially cheating. We’d like to use RMSE as
a measure of how well the model will predict on unseen data. If you haven’t already noticed, the way we
had been using RMSE resulted in RMSE decreasing as models became larger.
To correct for this, we will only use a portion of the data to fit the model, and then we will use leftover
data to evaluate the model. We will call these datasets train (for fitting) and test (for evaluating). The
definition of RMSE will stay the same
RMSE(model, data) =
vuut
1
n
Xn
i=1
(yi − yˆi)
2
where
• yi are the actual values of the response for the given data.
• yˆi are the predicted values using the fitted model and the predictors from the data.
However, we will now evaluate it on both the train set and the test set separately. So each model you fit
will have a train RMSE and a test RMSE. When calculating test RMSE, the predicted values will be found
by predicting the response using the test data with the model fit using the train data. Test data should
never be used to fit a model.
• Train RMSE: Model fit with train data. Evaluate on train data.
• Test RMSE: Model fit with train data. Evaluate on test data.
4
Set a seed of 11, and then split the Auto data into two datasets, one called auto_trn and one called
auto_tst. The auto_trn data frame should contain 292 randomly chosen observations. The auto_tst data
will contain the remaining observations. Hint: consider the following code:
set.seed(11)
auto_trn_idx = sample(1:nrow(Auto), 292)
Fit a total of five models using the training data.
• One must use all possible predictors.
• One must use only displacement as a predictor.
• The remaining three you can pick to be anything you like. One of these should be the best of the five
for predicting the response.
For each model report the train and test RMSE. Arrange your results in a well-formatted markdown table.
Argue that one of your models is the best for predicting the response.
Exercise 5 (Simulating Multiple Regression)
For this exercise we will simulate data from the following model:
Yi = β0 + β1xi1 + β2xi2 + β3xi3 + β4xi4 + β5xi5 + i
Where i ∼ N(0, σ2
). Also, the parameters are known to be:
• β0 = 2
• β1 = −0.75
• β2 = 1.5
• β3 = 0
• β4 = 0
• β5 = 2
• σ
2 = 25
We will use samples of size n = 42.
We will verify the distribution of βˆ
2 as well as investigate some hypothesis tests.
(a) We will first generate the X matrix and data frame that will be used throughout the exercise. Create
the following nine variables:
• x0: a vector of length n that contains all 1
• x1: a vector of length n that is randomly drawn from a normal distribution with a mean of 0 and a
standard deviation of 2
• x2: a vector of length n that is randomly drawn from a uniform distribution between 0 and 4
• x3: a vector of length n that is randomly drawn from a normal distribution with a mean of 0 and a
standard deviation of 1
• x4: a vector of length n that is randomly drawn from a uniform distribution between -2 and 2
• x5: a vector of length n that is randomly drawn from a normal distribution with a mean of 0 and a
standard deviation of 2
• X: a matrix that contains x0, x1, x2, x3, x4, and x5 as its columns
5
• C: the C matrix that is defined as (X>X)
−1
• y: a vector of length n that contains all 0
• sim_data: a data frame that stores y and the five predictor variables. y is currently a placeholder
that we will update during the simulation.
Report the sum of the diagonal of C as well as the 5th row of sim_data. For this exercise we will use the
seed 420. Generate the above variables in the order listed after running the code below to set a seed.
set.seed(420)
sample_size = 42
(b) Create three vectors of length 2500 that will store results from the simulation in part (c). Call them
beta_hat_1, beta_3_pval, and beta_5_pval.
(c) Simulate 2500 samples of size n = 42 from the model above. Each time update the y value of sim_data.
Then use lm() to fit a multiple regression model. Each time store:
• The value of βˆ
1 in beta_hat_1
• The p-value for the two-sided test of β3 = 0 in beta_3_pval
• The p-value for the two-sided test of β5 = 0 in beta_5_pval
(d) Based on the known values of X, what is the true distribution of βˆ
1?
(e) Calculate the mean and variance of beta_hat_1. Are they close to what we would expect? Plot a
histogram of beta_hat_1. Add a curve for the true distribution of βˆ
1. Does the curve seem to match the
histogram?
(f) What proportion of the p-values stored in beta_3_pval is less than 0.10? Is this what you would expect?
(g) What proportion of the p-values stored in beta_5_pval is less than 0.01? Is this what you would expect?
6