## Description

## 1 Questions (50 pts)

1. (9 pts) Explain whether each scenario is a classification or regression problem. And, provide the number of

data points (n) and the number of features (p).

(a) (3 pts) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number

of employees, industry and the CEO salary. We are interested in predicting CEO salary with given

factors.

Solution goes here.

(b) (3 pts) We are considering launching a new product and wish to know whether it will be a success or

a failure. We collect data on 20 similar products that were previously launched. For each product we

have recorded whether it was a success or failure, price charged for the product, marketing budget,

competition price, and ten other variables.

Solution goes here.

(c) (3 pts) We are interesting in predicting the % change in the US dollar in relation to the weekly changes

in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record

the % change in the dollar, the % change in the US market, the % change in the British market, and

the % change in the German market.

Solution goes here.

2. (6 pts) The table below provides a training data set containing six observations, three predictors, and one

qualitative response variable.

X1 X2 X3 Y

0 3 0 Red

2 0 0 Red

0 1 3 Red

0 1 2 Green

-1 0 1 Green

1 1 1 Red

Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest

neighbors.

(a) (2 pts) Compute the Euclidean distance between each observation and the test point, X1 = X2 =

X3 = 0.

Solution goes here.

(b) (2 pts) What is our prediction with K = 1? Why?

Solution goes here.

(c) (2 pts) What is our prediction with K = 3? Why?

Solution goes here.

3. (12 pts) When the number of features p is large, there tends to be a deterioration in the performance of

KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made.

This phenomenon is known as the curse of dimensionality,

and it ties into the fact that non-parametric approaches often perform poorly when p is large.

(a) (2pts) Suppose that we have a set of observations, each with measurements on p = 1 feature, X.

We assume that X is uniformly (evenly) distributed on [0, 1]. Associated with each observation is a

response value.

Suppose that we wish to predict a test observation’s response using only observations

that are within 10% of the range of X closest to that test observation. For instance, in order to predict

the response for a test observation with X = 0.6, we will use observations in the range [0.55, 0.65].

On average, what fraction of the available observations will we use to make the prediction?

Solution goes here.

(b) (2pts) Now suppose that we have a set of observations, each with measurements on p = 2 features, X1

and X2. We assume that predict a test observation’s response using only observations that (X1, X2)

are uniformly distributed on [0, 1] × [0, 1].

We wish to are within 10% of the range of X1 and within

10% of the range of X2 closest to that test observation. For instance, in order to predict the response

for a test observation with X1 = 0.6 and X2 = 0.35, we will use observations in the range [0.55,

0.65] for X1 and in the range [0.3, 0.4] for X2. On average, what fraction of the available observations

will we use to make the prediction?

Solution goes here.

(c) (2pts) Now suppose that we have a set of observations on p = 100 features. Again the observations

are uniformly distributed on each feature, and again each feature ranges in value from 0 to 1. We wish

to predict a test observation’s response using observations within the 10% of each feature’s range that

is closest to that test observation. What fraction of the available observations will we use to make the

prediction?

Solution goes here.

(d) (3pts) Using your answers to parts (a)–(c), argue that a drawback of KNN when p is large is that there

are very few training observations “near” any given test observation.

Solution goes here.

(e) (3pts) Now suppose that we wish to make a prediction for a test observation by creating a p-dimensional

hypercube centered around the test observation that contains, on average, 10% of the training observations. For p =1, 2, and 100, what is the length of each side of the hypercube? Comment on your

answer.

Solution goes here.

4. (6 pts) Supoose you trained a classifier for a spam detection system. The prediction result on the test set is

summarized in the following table.

Predicted class

Spam not Spam

Actual class Spam 8 2

not Spam 16 974

Calculate

(a) (2 pts) Accuracy Solution goes here.

(b) (2 pts) Precision Solution goes here.

(c) (2 pts) Recall Solution goes here.

5. (9pts) Again, suppose you trained a classifier for a spam filter. The prediction result on the test set is

summarized in the following table. Here, ”+” represents spam, and ”-” means not spam.

Confidence positive Correct class

0.95 +

0.85 +

0.8 –

0.7 +

0.55 +

0.45 –

0.4 +

0.3 +

0.2 –

0.1 –

(a) (6pts) Draw a ROC curve based on the above table.

Solution goes here.

(b) (3pts) (Real-world open question) Suppose you want to choose a threshold parameter so that mails

with confidence positives above the threshold can be classified as spam. Which value will you choose?

Justify your answer based on the ROC curve.

Solution goes here.

6. (8 pts) In this problem, we will walk through a single step of the gradient descent algorithm for logistic

regression. As a reminder,

f(x; θ) = σ(θ

⊤x)

Cross entropy loss L(ˆy, y) = −[y log ˆy + (1 − y) log(1 − yˆ)]

The single update step θ

t+1 = θ

t − η∇θL(f(x; θ), y)

(a) (4 pts) Compute the first gradient ∇θL(f(x; θ), y).

Solution goes here.

(b) (4 pts) Now assume a two dimensional input. After including a bias parameter for the first dimension,

we will have θ ∈ R

3

.

Initial parameters : θ

0 = [0, 0, 0]

Learning rate η = 0.1

data example : x = [1, 3, 2], y = 1

Compute the updated parameter vector θ

1

from the single update step.

Solution goes here.

## 2 Programming (50 pts)

1. (10 pts) Use the whole D2z.txt as training set. Use Euclidean distance (i.e. A = I). Visualize the predictions

of 1NN on a 2D grid [−2 : 0.1 : 2]2

.

That is, you should produce test points whose first feature goes over

−2, −1.9, −1.8, . . . , 1.9, 2, so does the second feature independent of the first feature. You should overlay

the training set in the plot, just make sure we can tell which points are training, which are grid.

The expected figure looks like this.

Spam filter Now, we will use ’emails.csv’ as our dataset. The description is as follows.

• Task: spam detection

• The number of rows: 5000

• The number of features: 3000 (Word frequency in each email)

• The label (y) column name: ‘Predictor’

• For a single training/test set split, use Email 1-4000 as the training set, Email 4001-5000 as the test

set.

• For 5-fold cross validation, split dataset in the following way.

– Fold 1, test set: Email 1-1000, training set: the rest (Email 1001-5000)

– Fold 2, test set: Email 1000-2000, training set: the rest

– Fold 3, test set: Email 2000-3000, training set: the rest

– Fold 4, test set: Email 3000-4000, training set: the rest

– Fold 5, test set: Email 4000-5000, training set: the rest

2. (8 pts) Implement 1NN, Run 5-fold cross validation. Report accuracy, precision, and recall in each fold.

Solution goes here.

3. (12 pts) Implement logistic regression (from scratch). Use gradient descent (refer to question 6 from part 1)

to find the optimal parameters. You may need to tune your learning rate to find a good optimum. Run 5-fold

cross validation. Report accuracy, precision, and recall in each fold.

Solution goes here.

4. (10 pts) Run 5-fold cross validation with kNN varying k (k=1, 3, 5, 7, 10). Plot the average accuracy versus

k, and list the average accuracy of each case.

Expected figure looks like this.

Solution goes here.

5. (10 pts) Use a single training/test setting. Train kNN (k=5) and logistic regression on the training set, and

draw ROC curves based on the test set.

Expected figure looks like this. Note that the logistic regression results may differ.

Solution goes here.