Description
1. Short Essay. The purpose of k-fold cross validation is often misunderstood.
a. (10 points) How do you use cross validation to select a final (or production) model? Note:
it is not the “best” of the k models you have built using cross validation.
2. PGA. The pgatour2006.csv dataset contains data for 196 players. The variables in the dataset are:
Player’s name
PrizeMoney = average prize money per tournament
DrivingAccuracy = percent of times a player is able to hit the fairway with his tee
shot
GIR = percent of time a player was able to hit the green within two or less than par
(Greens in Regulation)
BirdieConversion = percentage of times a player makes a birdie or better after
hitting the green in regulation
PuttingAverage = putting performance on those holes where the green was hit in
regulation.
PuttsPerRound= average number of putts per round (shots played on the green)
Etc.
a. (10 points) Build a complete first-order model. Evaluate the model using 5-fold cross
validation. If necessary, remove a non-significant variable and repeat until you have your
final first-order model. Present the model.
b. (10 points) Evaluate scatterplots to determine which second-order terms should be
tested. Test them using 5-fold cross validation and add them one-by-one until you arrive
at a model you feel is appropriate. Present the model.
c. (10 points) Beginning from scratch, engineer all possible second-order terms and add
them to your dataset. From this dataset, produce a model using backward selection.
Evaluate this model using 5-fold cross validation. Do you arrive at the same model as
above? Explain.
d. (10 points) You have used two procedures to build a second-order model. Compare these
two procedures. Which do you think is “best”? Explain.