## Description

Write your name on each page. Maximum score is 35 points, due date is Friday, February

7, 2014 . Please upload a CLEAN version of the solutions (one PDF file) on BlackBoard (before

10am) or hand in the solutions (CLEAN version) on the due date in class (hard copy). If you

turn in the home work in class make sure it is stapled!

1. The table below provides a training data set containing 6 observations, 3 variables (or predictors)

and 1 qualitative response variable. Suppose we wish to use this data set to make a prediction for

Observation X1 X2 X3 Y

1 0 3 0 Red

2 2 0 0 Red

3 0 1 3 Red

4 0 1 2 Green

5 -1 0 1 Green

6 1 1 1 Red

Y when X1 = X2 = X3 = 0 using k-nearest neighbors.

(a) [3 points] Compute the Euclidean distance between each observation and the test point, X1 =

X2 = X3 = 0.

(b) [2 points] What’s your prediction with k = 1? Explain.

(c) [2 points] What’s your prediction with k = 3? Explain.

(d) [2 points] If the Bayes decision boundary in this problem is highly nonlinear, then we would

expect the best value for k to be large or small? Explain.

2. (a) [5 points] Suppose we would like to fit a straight line through the origin i.e., Yi = β1 xi + ei

with i = 1, . . . , n, E[ei

] = 0, Var[ei

] = σ

2

e

and Cov[ei

, ej

] = 0, ∀i 6= j. Find the least squares

estimator βˆ

1 for the slope β1.

(b) [5 points] Calculate the bias and the variance for the estimated slope βˆ

1.

3. [10 points] Solve Exercise 8 in Chapter 2 on page 54 of the textbook (An Introduction to Statistical

Learning with Applications in R).

Guidelines regarding Question 3

1. This exercise has multiple parts. Please, answer each part separately and in a brief way. Be

direct to the point!

1

2. Type each question before you answer it, and provide a clear separation between each part.

3. All relevant computer output should be provided unless noted otherwise.

4. Attach your R code as an Appendix. Make sure to provide comments on what your code is

doing. Keep it clean and clear!

Hints

1. In part (a), when you read the data into R, make sure to check if the data has a header or

not.

2. In part (b), you dont need to use the fix() function to view the loaded dataset. Instead, and

since we are using Rstudio, you can view the data by clicking on the data name (college) in

the Workspace window in Rstudio. I am saying this, because sometimes the fix() function

might crash Rstudio especially if you are using Macs. Another option would be to use the

command View(college).

3. Parts (a) and (b) are for data manipulation (i.e. there is no need to include any output in the

report for submission). You will be mainly answering questions from part (c).

4. For part (c-iii), make sure to annotate the plot (title, x-axis, and y-axis).

5. If you want to learn more about a certain R function, you can use the command ?. For

example, if you want to learn more about the plot() function, type the command ?plot, and

a help document will pop up.

6. In part (c-v), when you use the command par(mforw(2,2)), the plotting screen should split

into 2×2 = 4 parts. To go back to the original setting, run the command par(mforw(1,1)).

4. Consider the following equation of a straight line Yi = β0 + β1 xi + ei with i = 1, . . . , n, E[ei

] = 0,

Var[ei

] = σ

2

e

and Cov[ei

, ej

] = 0, ∀i 6= j.

(a) [2 points] Calculate the bias for the estimator of the intercept βˆ

0.

(b) [4 points] Calculate the variance for the estimator of the intercept βˆ

0.

2