Description
Problem 1 (R exercise for linear Regression. 50 points)
Consider the data set “fat” in the “faraway”
library of R. The data is also available at T-square or at
fat <- read.table(file = “http://www.isye.gatech.edu/~ymei/7406/Handouts/fat.csv”,
sep=”,”, header=TRUE);
The dataset fat has 252 observations and 18 variables, and, for more detailed description, see the link
(http://artax.karlin.mff.cuni.cz/r-help/library/faraway/html/fat.html) or
(http://cran.r-project.org/web/packages/faraway/faraway.pdf) (page #37).
For more background information, you can also see
http://en.wikipedia.org/wiki/Body_fat_percentage
The purpose of this homework is to help you better understand linear regression and R. Here we assume
that the percentage of body fat using Brozek’s equation (brozek, the first column) as the response variable,
and the other 17 variables as potential predictors.
We will use several different statistical methods to fit this
dataset in the problem of predicting brozek using the other 17 potential predictors. For that purpose, it is
useful to split it into the following sub-tasks.
(a) First, we should split the original data set into disjoint training and testing data sets, so that we
can better evaluate and compare different models. One possible simple way is to random select a
proportion, say, 10% of observations from the data for use as a test sample, and use the remaining
data as a training sample building different models.
Note that in practice, it is more reasonable
to select much larger proportion, say 30% or 20%, as testing sample. Here we chose only 10%
as the testing sample, so that we can list those testing observations explicitly below. You can do so by
the following R code.
n = dim(fat)[1]; ### total number of observations
n1 = round(n/10); ### number of observations randomly selected for testing data
set.seed(7406); ### set the seed for randomization
flag = sort(sample(1:n, n1));
## If you are using other software, the 25 rows of testing observations are:
flag = c(1, 21, 22, 57, 70, 88, 91, 94, 121, 127, 149, 151, 159, 162,
164, 177, 179, 194, 206, 214, 215, 221, 240, 241, 243);
fat1train = fat[-flag,]; fat1test = fat[flag,];
(b) Second, for the training data “fat1train,” do some exploratory (or preliminary) data analysis
such as scatter plots or summary statistics of some variables that you feel are important (e.g., explain
the unusual pattern).
(c) Based on the training data “fat1train,” build the following models
(i) Linear regression with all predictors.
(ii) Linear regression with the best subset of k = 5 predictors variables;
(iii) Linear regression with variables (stepwise) selected using AIC;
(iv) Ridge regression;
(v) LASSO;
(vi) Principal component regression;
(vii) Partial least squares.
(d) Use the models you find in part (c) to predict the response in the testing data “fat1test” in part (a).
Report the performance of each model ˆf on the testing data, say, {(Y
test
i
, x
test
i
)}
n1
i=1.
Here n1 = 25 and
we assume that the performance of each model is evaluated by the following testing error
T E =
1
n1
Xn1
i=1
[Y
test
i − ˆf(x
test
i
)]2
.
(e) The above steps are sufficient when one has a large data set. However, for a relatively small data, one
may want to do further to assess the robustness of each method. One general approach is Monte
Carlo Cross-Validation algorithm that repeats the above computation B times (B = 100 say).
That is, for each loop b = 1, . . . , B, we randomly select, say n1 = 25, observations from the original data
as the testing data, and use the remaining data as a training sample. Within each loop, we first build
different models from “the training data of that specific loop”, and then evaluate their performances
on “the corresponding testing data.”
Therefore, for each model or method in part (c), we will obtain
B values of testing errors on B different subsets of testing data, denote by T Eb for b = 1, 2, . . . , B.
Then the “average” performances of each model can be summarized by the sample mean and sample
variables of these B TE values:
T E∗ =
1
B
X
B
b=1
T Eb and V ar ˆ (T E) = 1
B − 1
X
B
b=1
T Eb − T E∗
2
.
Compute and compare the “average” performances of each model mentioned in part (c).
Write a report to summarize your findings. The report should include (i) Introduction, (ii) Exploratory (or preliminary) Data Analysis of training data in part (a), (iii) Methods, (iv) Results
and (v) Findings. Also see the guidelines on the final report of our course project. Please attach your R
code (without, or with limited, output) in the appendix of your report, and please do not just dump the R
output in the body of the report.
Remark: In Part (c) and (e), please see the update R code for linear regression at Canvas. Note that in
part (e), the same original data is repeatedly used B times as a whole, but it is used differently at different
loops due to the different split of training and testing data.
The idea of repeating the similar data analysis
process B times is essential in many well-known statistical tools such as bootstrapping and Random
Forest, and has been widely used in other fields such as bioinformatics or computational biology.
For your convenience, I also post some R codes at the pdf file of this homework at Canvas that might
be useful. Please feel free to modify those R codes if you want.
To encourage everyone learn the materials,
each student must write their R or any other software codes by themselves, and no collaborations allowed!
It is cheating if you copy and paste your classmates’ computing codes.