Description
1. In this exercise you’ll create some simulated data and fit a simple linear regression model to it.
(a) [1 point] Perform the following commands in R
> set.seed (1)
> x1 <- runif (100) > x2 <- 0.5* x1+rnorm (100) /10 > Y <- 2+2* x1 +0.3* x2+rnorm (100) Write out the form of the linear model. What are the regression coefficients? (b) [1 point] What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables. (c) [2 points] Using this data, fit a least squares regression to predict Y using x1 and x2. Describe the results obtained. What are βˆ 0, βˆ 1 and βˆ 2? How do these relate to the true β0, β1 and β2? Can you reject the null hypothesis H0 : β1 = 0? How about H0 : β2 = 0? (d) [1 point] Now fit a least squares regression to predict Y using only x1. Comment on your results. Can you reject the null hypothesis H0 : β1 = 0? (e) [1 point] Now fit a least squares regression to predict Y using only x2. Comment on your results. Can you reject the null hypothesis H0 : β1 = 0? (f) [2 points] Do the results obtained in (c)-(e) contradict each other? Explain your answer. (g) [3 points] Now suppose we obtain one additional observation, which was unfortunately mismeasured. > x1 <- c(x1 , 0.1) > x2 <- c(x2 , 0.8) > y <- c(y,6)
Re-fit the linear models from (c) to (e) using this new data. What effect does this new
observation have on the each of the models? In each model, is this observation an outlier? A
high-leverage point? Both? Explain your answers and make suitable plots.
2. [6 points] This problem relates to the QDA model, in which the observations within each class are
drawn from a normal distribution with a classspecific mean vector and a class specific covariance
1
matrix. We consider the simple case where p = 1; i.e. there is only one feature. Suppose that
we have K classes, and that if an observation belongs to the kth class then X comes from a
one-dimensional normal distribution, X ∼ N(µk, σ2
k
). Recall that the density function for the onedimensional normal distribution is given in Eq. 4.11 in the textbook. Prove that in this case, the
Bayes classifier is not linear. Argue that it is in fact quadratic.
3. [6 points] Suppose that we wish to predict whether a given stock will issue a dividend this year
(“Yes” or “No”) based on X, last years percent profit. We examine a large number of companies
and discover that the mean value of X for companies that issued a dividend was X = 10, while
the mean for those that didn’t was X = 0. In addition, the variance of X for these two sets of
companies was σ
2 = 36. Finally, 80% of companies issued dividends. Assuming that X follows a
normal distribution, predict the probability that a company will issue a dividend this year given
that its percentage profit was X = 4 last year.
4. This question should be answered using the Weekly data set, which is part of the ISLR package.
This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains
1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
(a) [2 points] Produce some numerical and graphical summaries of the Weekly data. Do there
appear to be any patterns?
(b) [2 points] Use the full data set to perform a logistic regression with Direction as the response
and the five lag variables plus Volume as predictors. Use the summary function to print the
results. Do any of the predictors appear to be statistically significant? If so, which ones?
(c) [2 points] Compute the confusion matrix and overall fraction of correct predictions. Explain
what the confusion matrix is telling you about the types of mistakes made by logistic regression.
(d) [2 points] Now fit the logistic regression model using a training data period from 1990 to 2008,
with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of
correct predictions for the held out data (that is, the data from 2009 and 2010).
(e) [2 points] Repeat (d) using LDA.
(f) [2 points] Repeat (d) using QDA.
(g) [1 point] Is it justified to use QDA? Use appropriate hypothesis test(s) we’ve seen in class.
(h) [2 points] Repeat (d) using KNN with K = 1.
(i) [1 point] Which of these methods appears to provide the best results on this data?
(j) [1 point] Could you create a better classifier? How would you do this?
2