Description
The last lab introduced you to SLR with a data set that had a non-linear trend. This meant
that a straight line was an inappropriate choice for a model. However, this model was
applied and some skills developed like plotting points, segments, adding the fitted line and
determining estimates of parameters from summary output and interpreting multiple π
2
.
Today we will begin where the last lab left off and examine the assumptions of the linear
model. If the assumptions hold we say that the analysis performed is valid.
Objectives
In this lab you will learn how to:
1. Create a linear model with π₯
2
and π₯ variables.
2. Create residual plots for two models and be able to compare and interpret them.
3. Create QQ plots and interpret them.
4. Create and interpret the Shapiro Wilk test.
5. Interpret regression summary output (similar to last lab).
6. Make predictions for the new model.
7. Learn about piecemeal regression.
8. Learn how to make an R package using the package roxygen2
Tasks
Make an RMD document and then knit to HTML
Upload both to canvas
Note: All plots you are asked to make should be recorded in this
document.
β’ Task 1
o Download from CANVAS the zipped data files, βDataxlsβ
o Unzip the contents into a directory on your desktop (call it LAB4)
o Download the file βlab4.rβ
o Place this file with the others in LAB4.
o Start Rstudio
o Open βlab4.rβ from within Rstudio.
o Go to the βsessionβ menu within Rstudio and βset working directoryβ to where the source
files are located.
o Issue the function getwd()
β’ Task 2
o Find the file βSPRUCE.xlsβ inside LAB4
o Open it in Excel
o Save As type CSV(comma delimited) β*.csvβ
o Use read.table(file.choose(), header=TRUE,sep=β,β) to read the data
into R (or any other method available), this function will already be available within the
script lab4.r which you have opened in Rstudio.
o Copy and paste the last six lines of the data using βtail()β :
o Make a new file for your code in RStudio editor, call it βmylab4.Rβ and place in it all the
code you need to answer the tasks of this lab (copy and paste from lab4.R).
o Use the hash # symbol and write your own comments in the code file explaining what the
code does.
β’ Task 3
o The SPRUCE data set is described in MS 10.52, pages 478 and 479. This data set has two
variables, Height = Height of Spruce trees in m (this is what we want to predict) and
BHDiameter = Breast height Diameter in cm. The idea is that breast height diameter is an
easy measurement to make whereas the height of the trees is much more difficult. We
want to see if there is a relationship between the two variables that enables us to predict
Height from Diameter.
o Load the library s20x and make a lowess smoother scatter plot (Height Vs BHDiameter)
using trendscatter() (use f=0.5) record the plot.
o Make a linear model object,
spruce.lm=with(spruce.df,lm(Height~BHDiameter))
o Find the residuals using residuals(), put them into an object called height.res
o Find the fitted values using fitted() and place them in an object called
height.fit.
o Plot the residuals vs fitted values.
o Plot the residuals vs fitted values using trendscatter()
o What shape is seen in the plot? Compare it with the curve made with the trendscatter
function (second line after Task3).
o Using the plot() function and spruce.lm, make the residual plot.
o Check normality using the s20x function normcheck(). Please note that you may need
to add an additional option to show the Shapiro-Wilk test (use ?normcheck )
o What is the pvalue for the Shapiro-Wilk test? What is the NULL hypothesis in this case?
o π¦π = π½0 + π½1π₯π + ππ, ππ βΌ π(0, π
2
) describes the model used above. Notice that the
residuals ππ estimate the model errors ππ
. If the model works well with the data we should
expect that the residuals are approximately Normal in distribution with mean 0 and
constant variance.
o Write a sentence outlining your conclusions concerning the validity of applying the
straight line to this data set.
β’ Task 4
o Fit a quadratic to the points using the appropriate formula inside the lm() function and
placing the output in the object quad.lm.
o Make a fresh scatter plot of Height Vs BHDiameter and add the quadratic curve to it.
o Make quad.fit, a vector of fitted values.
o Make a plot of the residuals vs fitted values, use plot() and quad.lm
o Construct a QQ plot using normcheck()
o What is the value of the p-value in the Shapiro-Wilk test? What do you conclude?
β’ Task 5
o Summarize quad.lm paste it here.
o What is the value of π½Μ0?
o What is the value of π½Μ1
o What is the value of π½Μ2
o Make interval estimates for π½0, π½1, π½2.
o Write down the equation of the fitted line.
o Predict the Height of spruce when the Diameter is 15, 18 and 20cm (use
predict())
o Compare with the previous predictions.
o What is the value of multiple π
? Compare it with the previous model.
o Make use of adjusted R squared to compare models to determine which is βbetterβ.
Use the web to learn about adjusted R squared.
o What does ( ππ’ππ‘ππππ π
2
) mean in this case?
o Which model explains the most variability in the Height?
o Use anova() and compare the two models. Paste anova output here and give your
conclusion underneath.
o Find TSS, record it here
o Find MSS, record it here
o Find RSS, record it here
o What is the value of MSS/TSS?
β’ Task 6
o Investigate unusual points by making a cooks plot using cooks20x(). Place the plot
here.
o Use the web to find out what cooks distance is and how it is used β write a couple of
sentences here.
o What does cooks distance for the quadratic model and data tell you?
o Make a new object called quad2.lm which is made from the same quadratic model
using the data with the datum which has highest cooks distance removed.
o Summarize the new object here.
o Compare with the summary information from quad.lm
o What do you conclude?
β’ Task 7
o Prove using latex that π¦ = π½0 + π½1π₯ + π½2
(π₯ β π₯π
)πΌ(π₯ > π₯π) where I() is 1 when π₯ > π₯π
and 0 else.
o Reproduce the above plot using the code included in the R script (π₯π = 18) , you may
need to change some of the parameter values.
β’ Task 8
o Please install the following packages. You may use the single function:
o install.packages(c(“devtools”, “roxygen2”, “testthat”, “knitr”))
o Follow the demonstration here: https://youtu.be/DWkIbk_HE9o or that given by me.
o Add one function to your package from the labs we have done so far (Labs 1-4). Make
sure it is documented. This is VERY important β learn about how to do this with roxygen
β see http://r-pkgs.had.co.nz/man.html#man
o Every lab you will add a function to this package.
o In an RMD chunk, load your package using library()
o Use your function so that I can see the output.
o Explain in a couple of sentences what the function does.
###################### LAB 4 comes to here β the rest is extra if you finish early ###############
Extra for experts: Produce the plot below (you will need, segments(), text(), arrows())