## Description

## Learning Objectives:

By successfully completing this assignment you will be able to…

- Explain the bias-variance tradeoff of supervised machine learning and the impact of model flexibility on algorithm performance
- Perform supervised machine learning training and performance evaluation
- Implement a k-nearest neighbors machine learning algorithm from scratch in a style similar to that of popular machine learning tools like
`scikit-learn`

- Describe how KNN classification works, the method’s reliance on distance measurements, and the impact of higher dimensionality on computational speed
- Apply regression (linear regression) and classification (KNN) supervised learning techniques to data and evaluate the performance of those methods
- Construct simple feature transformations for improving model fit in linear models
- Fit a
`scikit-learn`

supervised learning technique to training data and make predictions using it

```
# MAC USERS TAKE NOTE:
# For clearer plots in Jupyter notebooks on macs, run the following line of code:
# %config InlineBackend.figure_format = 'retina'
```

# Conceptual Questions on Supervised Learning

## 1

**[4 points]** For each part below, indicate whether we would generally expect the performance of a flexible statistical learning method to be *better* or *worse* than an inflexible method. Justify your answer.

- The sample size n𝑛 is extremely large, and the number of predictors p𝑝 is small.
- The number of predictors p𝑝 is extremely large, and the number of observations n𝑛 is small.
- The relationship between the predictors and response is highly non-linear.
- The variance of the error terms, i.e. σ2=Var(ϵ)𝜎2=𝑉𝑎𝑟(𝜖), is extremely high

**ANSWER**

## 2

**[6 points]** For each of the following, (i) explain if each scenario is a classification or regression problem AND why, (ii) indicate whether we are most interested in inference or prediction for that problem AND why, and (iii) provide the sample size n𝑛 and number of predictors p𝑝 indicated for each scenario.

**(a)** We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

**(b)** We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

**(c)** We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market.

**ANSWER**

# Practical Questions

## 3

**[6 points] Classification using KNN**. The table below provides a training dataset containing six observations (a.k.a. samples) (n=6𝑛=6) each with three predictors (a.k.a. features) (p=3𝑝=3), and one qualitative response variable (a.k.a. target).

*Table 1. Training dataset with n=6𝑛=6 observations in p=3𝑝=3 dimensions with a categorical response, y𝑦*

Obs. | x1𝑥1 | x2𝑥2 | x3𝑥3 | y𝑦 |
---|---|---|---|---|

1 |
0 | 3 | 0 | Red |

2 |
2 | 0 | 0 | Red |

3 |
0 | 1 | 3 | Red |

4 |
0 | 1 | 2 | Blue |

5 |
-1 | 0 | 1 | Blue |

6 |
1 | 1 | 1 | Red |

We want to use the above training dataset to make a prediction, y^𝑦^, for an unlabeled test data observation where x1=x2=x3=0𝑥1=𝑥2=𝑥3=0 using K𝐾-nearest neighbors. You are given some code below to get you started. *Note: coding is only required for part (a), for (b)-(d) please provide your reasoning based on your answer to part (a)*.

**(a)** Compute the Euclidean distance between each observation and the test point, x1=x2=x3=0𝑥1=𝑥2=𝑥3=0. Present your answer in a table similar in style to Table 1 with observations 1-6 as the row headers.

**(b)** What is our prediction, y^𝑦^, when K=1𝐾=1 for the test point? Why?

**(c)** What is our prediction, y^𝑦^, when K=3𝐾=3 for the test point? Why?

**(d)** If the Bayes decision boundary (the optimal decision boundary) in this problem is highly nonlinear, then would we expect the *best* value of K𝐾 to be large or small? Why?

```
import numpy as np
X = np.array([[ 0, 3, 0],
[ 2, 0, 0],
[ 0, 1, 3],
[ 0, 1, 2],
[-1, 0, 1],
[ 1, 1, 1]])
y = np.array(['r','r','r','b','b','r'])
```

**ANSWER**:

## 4

**[18 points] Build your own classification algorithm**.

**(a)** Build a working version of a binary KNN classifier using the skeleton code below. We’ll use the `sklearn`

convention that a supervised learning algorithm has the methods `fit`

which trains your algorithm (for KNN that means storing the data) and `predict`

which identifies the K nearest neighbors and determines the most common class among those K neighbors. *Note: Most classification algorithms typically also have a method predict_proba which outputs the confidence score of each prediction, but we will explore that in a later assignment.*

**(b)** Load the datasets to be evaluated here. Each includes training features (X𝑋), and test features (y𝑦) for both a low dimensional dataset (p=2𝑝=2 features/predictors) and a higher dimensional dataset (p=100𝑝=100 features/predictors). For each of these datasets there are n=1000𝑛=1000 observations of each. They can be found in the `data`

subfolder in the `assignments`

folder on github. Each file is labeled similar to `A2_X_train_low.csv`

, which lets you know whether the dataset is of features, X𝑋, targets, y𝑦; training or testing; and low or high dimensions.

**(c)** Train your classifier on first the low dimensional dataset and then the high dimensional dataset with k=5𝑘=5. Evaluate the classification performance on the corresponding test data for each of those trained models. Calculate the time it takes each model to make the predictions and the overall accuracy of those predictions for each corresponding set of test data – state each.

**(d)** Compare your implementation’s accuracy and computation time to the scikit learn KNeighborsClassifier class. How do the results and speed compare to your implementation?

**(e)** Some supervised learning algorithms are more computationally intensive during training than testing. What are the drawbacks of the prediction process being slow? In what cases in practice might slow testing (inference) be more problematic than slow training?

```
# Skeleton code for part (a) to write your own kNN classifier
class Knn:
# k-Nearest Neighbor class object for classification training and testing
def __init__(self):
def fit(self, x, y):
# Save the training data to properties of this class
def predict(self, x, k):
y_hat = [] # Variable to store the estimated class label for
# Calculate the distance from each vector in x to the training data
# Return the estimated targets
return y_hat
# Metric of overall classification accuracy
# (a more general function, sklearn.metrics.accuracy_score, is also available)
def accuracy(y,y_hat):
nvalues = len(y)
accuracy = sum(y == y_hat) / nvalues
return accuracy
```

**ANSWER**:

## 5

**[20 points] Bias-variance tradeoff: exploring the tradeoff with a KNN classifier**. This exercise will illustrate the impact of the bias-variance tradeoff on classifier performance by investigating how model flexibility impacts classifier decision boundaries. For this problem, please us Scikit-learn’s KNN implementation rather than your own implementation, as you did at the end of the last question.

**(a)** Create a synthetic dataset (with both features and targets). Use the `make_moons`

module with the parameter `noise=0.35`

to generate 1000 random samples.

**(b)** Visualize your data: scatterplot your random samples with each class in a different color.

**(c)** Create 3 different data subsets by selecting 100 of the 1000 data points at random three times (with replacement). For each of these 100-sample datasets, fit three separate k-Nearest Neighbor classifiers with: k={1,25,50}𝑘={1,25,50}. This will result in 9 combinations (3 datasets, each with 3 trained classifiers).

**(d)** For each combination of dataset and trained classifier plot the decision boundary (similar in style to Figure 2.15 from *Introduction to Statistical Learning*). This should form a 3-by-3 grid. Each column should represent a different value of k𝑘 and each row should represent a different dataset.

**(e)** What do you notice about the difference between the rows and the columns. Which decision boundaries appear to best separate the two classes of data? Which decision boundaries vary the most as the data change?

**(f)** Explain the bias-variance tradeoff using the example of the plots you made in this exercise and its implications for training supervised machine learning algorithms.

Notes and tips for plotting decision boundaries (as in part d):

*Resource for plotting decision boundaries with meshgrid and contour: https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html*- If you would like to change the colors of the background, and do not like any of the existing cmap available in matplotlib, you can make your own cmap using the 2 sets of rgb values. Sample code (replace r, g, b with respective rgb values):

```
from matplotlib.colors import LinearSegmentedColormap
newcmp = LinearSegmentedColormap.from_list("new", [(r/255, g/255, b/255), (r/255, g/255, b/255)], N=2)
```

**ANSWER**

## 6

**[18 points] Bias-variance trade-off II: Quantifying the tradeoff**. This exercise explores the impact of the bias-variance tradeoff on classifier performance by looking at the performance on both training and test data.

Here, the value of k𝑘 determines how flexible our model is.

**(a)** Using the function created earlier to generate random samples (using the `make_moons`

function setting the `noise`

parameter to 0.35), create a new set of 1000 random samples, and call this dataset your test set and the previously created dataset your training set.

**(b)** Train a kNN classifier on your training set for k=1,2,...500𝑘=1,2,…500. Apply each of these trained classifiers to both your training dataset and your test dataset and plot the classification error (fraction of incorrect predictions).

**(c)** What trend do you see in the results?

**(d)** What values of k𝑘 represent high bias and which represent high variance?

**(e)** What is the optimal value of k𝑘 and why?

**(f)** In KNN classifiers, the value of k controls the flexibility of the model – what controls the flexibility of other models?

**ANSWER**

## 7

**[18 points] Linear regression and nonlinear transformations**. Linear regression can be used to model nonlinear relationships when feature variables are properly transformed to represent the nonlinearities in the data. In this exercise, you’re given training and test data contained in files “A2_Q7_train.csv” and “A2_Q7_test.csv” in the “data” folder for this assignment. Your goal is to develop a regression algorithm from the training data that performs well on the test data.

*Hint: Use the scikit learn LinearRegression module.*

**(a)** Create a scatter plot of your training data.

**(b)** Estimate a linear regression model (y=a0+a1x𝑦=𝑎0+𝑎1𝑥) for the training data and calculate both the R2𝑅2 value and mean square error for the fit of that model for the training data. Also provide the equation representing the estimated model (e.g. y=a0+a1x𝑦=𝑎0+𝑎1𝑥, but with the estimated coefficients inserted. Consider this your baseline model against which you will compare other model options. *Evaluating performance on the training data is not a measure of how well this model would generalize to unseen data. We will evaluate performance on the test data once we see our models fit the training data decently well.*

**(c)** If features can be nonlinearly transformed, a linear model may incorporate those non-linear feature transformation relationships in the training process. From looking at the scatter plot of the training data, choose a transformation of the predictor variable, x𝑥 that may make sense for these data. This will be a multiple regression model of the form y=a0+a1z1+a2z2+…+anzn𝑦=𝑎0+𝑎1𝑧1+𝑎2𝑧2+…+𝑎𝑛𝑧𝑛. Here zi𝑧𝑖 could be any transformations of x – perhaps it’s 1x1𝑥, log(x)𝑙𝑜𝑔(𝑥), sin(x)𝑠𝑖𝑛(𝑥), xk𝑥𝑘 (where k𝑘 is any power of your choosing). Provide the estimated equation for this multiple regression model (e.g. if you chose your predictors to be z1=x𝑧1=𝑥 and z2=log(x)𝑧2=𝑙𝑜𝑔(𝑥), your model would be of the form y=a0+a1x+a2log(x)𝑦=𝑎0+𝑎1𝑥+𝑎2𝑙𝑜𝑔(𝑥). Also provide the R2𝑅2 and mean square error of the fit for the training data.

**(d)** Visualize the model fit to the training data. Using both of the models you created in parts (b) and (c), plot the original data (as a scatter plot) AND the curves representing your models (each as a separate curve) from (b) and (c).

**(e)** Now its time to compare your models and evaluate the generalization performance on held out test data. Using the models above from (b) an (c), apply them to the test data and estimate the R2𝑅2 and mean square error of the test dataset.

**(f)** Which models perform better on the training data, and which on the test data? Why?

**(g)** Imagine that the test data were significantly different from the training dataset. How might this affect the predictive capability of your model? How would the accuracy of generalization performance be impacted? Why?

*To help get you started – here’s some code to help you load in the data for this exercise (you’ll just need to update the path)*:

```
import numpy as np
import pandas as pd
path = './data/'
train = pd.read_csv(path + 'A2_Q7_train.csv')
test = pd.read_csv(path + 'A2_Q7_test.csv')
x_train = train.x.values
y_train = train.y.values
x_test = test.x.values
y_test = test.y.values
```