Description
Exercise 1 (Using lm for Inference)
For this exercise we will use the cats dataset from the MASS package. You should use ?cats to learn about
the background of this dataset.
(a) Fit the following simple linear regression model in R. Use heart weight as the response and body weight
as the predictor.
Yi = β0 + β1xi + i
Store the results in a variable called cat_model. Use a t test to test the significance of the regression. Report
the following:
• The null and alternative hypotheses
• The value of the test statistic
• The p-value of the test
• A statistical decision at α = 0.05
• A conclusion in the context of the problem
When reporting these, you should explicitly state them in your document, not assume that a reader will find
and interpret them from a large block of R output.
(b) Calculate a 95% confidence interval for β1. Give an interpretation of the interval in the context of the
problem.
(c) Calculate a 90% confidence interval for β0. Give an interpretation of the interval in the context of the
problem.
(d) Use a 90% confidence interval to estimate the mean heart weight for body weights of 2.1 and 2.8
kilograms. Which of the two intervals is wider? Why?
(e) Use a 90% prediction interval to predict the heart weight for body weights of 2.8 and 4.2 kilograms.
1
(f) Create a scatterplot of the data. Add the regression line, 95% confidence bands, and 95% prediction
bands.
(g) Use a t test to test:
• H0 : β1 = 4
• H1 : β1 6= 4
Report the following:
• The value of the test statistic
• The p-value of the test
• A statistical decision at α = 0.05
When reporting these, you should explicitly state them in your document, not assume that a reader will find
and interpret them from a large block of R output.
Exercise 2 (More lm for Inference)
For this exercise we will use the Ozone dataset from the mlbench package. You should use ?Ozone to learn
about the background of this dataset. You may need to install the mlbench package. If you do so, do not
include code to install the package in your R Markdown document.
For simplicity, we will re-perform the data cleaning done in the previous homework.
data(Ozone, package = “mlbench”)
Ozone = Ozone[, c(4, 6, 7, 8)]
colnames(Ozone) = c(“ozone”, “wind”, “humidity”, “temp”)
Ozone = Ozone[complete.cases(Ozone), ]
(a) Fit the following simple linear regression model in R. Use the ozone measurement as the response and
wind speed as the predictor.
Yi = β0 + β1xi + i
Store the results in a variable called ozone_wind_model. Use a t test to test the significance of the regression.
Report the following:
• The null and alternative hypotheses
• The value of the test statistic
• The p-value of the test
• A statistical decision at α = 0.01
• A conclusion in the context of the problem
When reporting these, you should explicitly state them in your document, not assume that a reader will find
and interpret them from a large block of R output.
(b) Fit the following simple linear regression model in R. Use the ozone measurement as the response and
temperature as the predictor.
2
Yi = β0 + β1xi + i
Store the results in a variable called ozone_temp_model. Use a t test to test the significance of the regression.
Report the following:
• The null and alternative hypotheses
• The value of the test statistic
• The p-value of the test
• A statistical decision at α = 0.01
• A conclusion in the context of the problem
When reporting these, you should explicitly state them in your document, not assume that a reader will find
and interpret them from a large block of R output.
Exercise 3 (Simulating Sampling Distributions)
For this exercise we will simulate data from the following model:
Yi = β0 + β1xi + i
Where i ∼ N(0, σ2
). Also, the parameters are known to be:
• β0 = −5
• β1 = 3.25
• σ
2 = 16
We will use samples of size n = 50.
(a) Simulate this model 2000 times. Each time use lm() to fit a simple linear regression model, then store
the value of βˆ
0 and βˆ
1. Set a seed using your birthday before performing the simulation. Note, we are
simulating the x values once, and then they remain fixed for the remainder of the exercise.
birthday = 18760613
set.seed(birthday)
n = 50
x = seq(0, 10, length = n)
(b) Create a table that summarizes the results of the simulations. The table should have two columns, one
for βˆ
0 and one for βˆ
1. The table should have four rows:
• A row for the true expected value given the known values of x
• A row for the mean of the simulated values
• A row for the true standard deviation given the known values of x
• A row for the standard deviation of the simulated values
(c) Plot two histograms side-by-side:
• A histogram of your simulated values for βˆ
0. Add the normal curve for the true sampling distribution
of βˆ
0.
• A histogram of your simulated values for βˆ
1. Add the normal curve for the true sampling distribution
of βˆ
1.
3
Exercise 4 (Simulating Confidence Intervals)
For this exercise we will simulate data from the following model:
Yi = β0 + β1xi + i
Where i ∼ N(0, σ2
). Also, the parameters are known to be:
• β0 = 5
• β1 = 2
• σ
2 = 9
We will use samples of size n = 25.
Our goal here is to use simulation to verify that the confidence intervals really do have their stated confidence
level. Do not use the confint() function for this entire exercise.
(a) Simulate this model 2500 times. Each time use lm() to fit a simple linear regression model, then store
the value of βˆ
1 and se. Set a seed using your birthday before performing the simulation. Note, we are
simulating the x values once, and then they remain fixed for the remainder of the exercise.
birthday = 18760613
set.seed(birthday)
n = 25
x = seq(0, 2.5, length = n)
(b) For each of the βˆ
1 that you simulated, calculate a 95% confidence interval. Store the lower limits in a
vector lower_95 and the upper limits in a vector upper_95. Some hints:
• You will need to use qt() to calculate the critical value, which will be the same for each interval.
• Remember that x is fixed, so Sxx will be the same for each interval.
• You could, but do not need to write a for loop. Remember vectorized operations.
(c) What proportion of these intervals contains the true value of β1?
(d) Based on these intervals, what proportion of the simulations would reject the test H0 : β1 = 0 vs
H1 : β1 6= 0 at α = 0.05?
(e) For each of the βˆ
1 that you simulated, calculate a 99% confidence interval. Store the lower limits in a
vector lower_99 and the upper limits in a vector upper_99.
(f) What proportion of these intervals contains the true value of β1?
(g) Based on these intervals, what proportion of the simulations would reject the test H0 : β1 = 0 vs
H1 : β1 6= 0 at α = 0.01?
Exercise 5 (Prediction Intervals “without” predict)
Write a function named calc_pred_int that performs calculates prediction intervals:
yˆ(x) ± tα/2,n−2 · se
s
1 +
1
n
+
(x − x¯)
2
Sxx
.
4
for the linear model
Yi = β0 + β1xi + i
.
(a) Write this function. You may use the predict() function, but you may not supply a value for the
level argument of predict(). (You can certainly use predict() any way you would like in order to check
your work.)
The function should take three inputs:
• model, a model object that is the result of fitting the SLR model with lm()
• newdata, a data frame with a single observation (row)
– This data frame will need to have a variable (column) with the same name as the data used to
fit model.
• level, the level (0.90, 0.95, etc) for the interval with a default value of 0.95
The function should return a named vector with three elements:
• estimate, the midpoint of the interval
• lower, the lower bound of the interval
• upper, the upper bound of the interval
(b) After writing the function, run this code:
newcat_1 = data.frame(Bwt = 4.0)
calc_pred_int(cat_model, newcat_1)
(c) After writing the function, run this code:
newcat_2 = data.frame(Bwt = 3.3)
calc_pred_int(cat_model, newcat_2, level = 0.90)
5