Description
Pipeline A standard machine learning pipeline usually consists of three parts. 1) Load and pre-process
the data. 2) Train a model on the training set and use the validation set to tune hyperparameters. 3) Test
the final model and report the result. In this assignment, we will provide you a template for each section
(linear regression.py, knn.py), which follows this 3-step pipeline. We provide the data loading
and preprocessing codes in step 1 and define the output format in step 3. You will be asked to implement
step 2’s algorithms to complete the pipeline. Do not make any modification to our implementations for step
1 and 3.
Please do not import packages that are not listed in the provided scripts. Follow the instructions in each
section strictly to code up your solutions. DO NOT CHANGE THE OUTPUT FORMAT. DO NOT MODIFY
THE CODE UNLESS WE INSTRUCT YOU TO DO SO. A homework solution that mismatches the provided
setup, such as format, name, initializations, etc., will not be graded. It is your responsibility to make sure
that your code runs with python3 on the VM.
Datasets
Regression Dataset The UCI Wine Quality dataset lists 11 chemical measurements of 4898 white wine
samples as well as an overall quality per sample, as determined by wine connoisseurs. See winequalitywhite.csv. We split the data into training, validation and test sets in the preprocessing code. You will use
linear regression to predict wine quality from the chemical measurement features.
1
Figure 1: Example output
Classification Dataset MNIST is one of the most well-known datasets in computer vision, consisting of
images of handwritten digits from 0 to 9. We will be working with a subset of the official version of
MNIST, denoted as mnist subset. In particular, we randomly sampled 700 images from each category
and split them into training, validation, and test sets. This subset corresponds to a JSON file named
mnist subset.json. JSON is a lightweight data-interchange format, similar to a dictionary. After loading the file, you can access its training, validation, and test splits using the keys ‘train’, ‘valid’, and ‘test’,
respectively. For example, if we load mnist subset.json to the variable x, x[
0
train0
] refers to the training set
of mnist subset. This set is a list with two elements: x[
0
train0
][0] containing the features of size N (samples)
×D (dimension of features), and x[
0
train0
][1] containing the corresponding labels of size N.
Example output For linear regression.py in Problem 1, you should be able to run it on VM and see
output similar to Fig. 1.
Collaboration: Please consult the syllabus for what is and is not acceptable collaboration. Review the rules
on academic conduct in the syllabus: a single instance of plagiarism can adversely affect you significantly
more than you could stand to gain.
2
Problem 1 Linear Regression (30 points)
You are asked to implement 4 python functions for linear regression. The input and output of the functions
are specified in linear regression.py. You should be able to run linear regression.py after you
finish the implementation. Note that we have already appended the column of 1’s to the feature matrix, so
that you do not need to modify the data yourself.
Problem 1.1 Linear regression (6 points)
Implement linear regression and return the model parameters. What to submit: fill in the function lin
ear regression noreg(X, y).
Problem 1.2 Regularized linear regression (10 points)
To prevent overfitting, we usually add regularization. For now, we will focus on L2 regularization. In
this case, the optimization problem is:
wλ = arg min
w
||Xw − y||2
2 + λ||w||2
2
(1)
where λ ≥ 0 is a hyper-parameter used to control the complexity of the resulting model. When λ = 0,
the model reduces to the usual (unregularized) linear regression. For λ > 0 the objective function balances
between two terms: (1) the data-dependent quadratic loss function ||Xw − y||2
2
, and (2) a function of the
model parameters ||w||2
2
.
Implement your regularized linear regression algorithm.
What to submit: fill in function regularized linear regression(X, y, λ).
Problem 1.3 Tuning the regularization hyper-parameter (9 points)
Use the validation set to tune the regularization parameter λ ∈ {0, 10−4
, 10−3
, 10−2
, 10−1
, 1, 10, 102}. We
select the best one that results in the lowest mean square error on the validation set.
What to submit: fill in the function tune lambda(Xtrain, ytrain, Xval, yval, lambds).
Problem 1.4 Mean square error (5 points)
Report the mean square error of the model on the give n test set.
What to submit: fill in the function test error(w, X, y).
Problem 2 k Nearest Neighbors
(20 points)
Review In the lecture, we define the classification rule of the k-nearest neighbors (kNN) algorithm for an
input example x as
vc = ∑
xi∈knn(x)
1(yi == c), ∀c ∈ [C] (2)
y = arg max
c∈[C]
vc (3)
where [C] is the set of classes.
3
A common distance measure between two samples xi and xj
is the Euclidean distance:
d(xi
, xj) = kxi − xjk2 =
r
∑
d
(xid − djd)
2
. (4)
You are asked to implement 4 python functions for kNN. The input and output are specified in knn.py.
You should be able to run knn.py after you finish the implementation.
Problem 2.1 Distance calculation (3 points)
Compute the distance between test data points in X and training data points in Xtrain based on Eqn. 4.
What to submit: fill in the function compute distances(Xtrain, X).
Problem 2.2 kNN classifier (5 points)
Implement kNN classifier based on Eqn. 3. Your algorithm should output the predictions for the test
set. Important: You do not need to worry about ties in distance when finding the k nearest neighbor set.
However, when there are ties in the majority label of the k nearest neighbor set, you should return the label
with the smallest index. For example, when k = 5, if the labels of the 5 nearest neighbors happen to be
1, 1, 2, 2, 7, your prediction should be the digit 1.
What to submit: fill in the function predict labels(k, ytrain, dists).
Problem 2.3 Report the accuracy (6 points)
The classification accuracy is defined as:
accuracy =
# of correctly classified test examples
# of test examples (5)
The accuracy value should be in the range of
0, 1
What to submit: fill in the code for function compute accuracy(y, ypred).
Problem 2.4 Tuning k (6 points)
Find k among
1, 3, 5, 7, 9
that gives the best classification accuracy on the validation set.
What to submit: fill in the code for function find best k(K, ytrain, dists, yval).