Description
Instructions
In this assignment you are required to write 3 scripts in python. They will be submitted as 3 separate files,
although you are free to copy chunks of code from one script to the next as desired. The names for the script
files are specified in problems 1, 5 and 8, below.
Included in the homework 2 release are two sample scripts (fitpoly incomplete.py and cv demo incomplete.py)
and two data files (womens100.csv, and synthdata2014.csv). The sample scripts are provided for your
convenience – you may use any part of the code in those scripts for your submissions. Note that neither
will run just as provided: you must fill in the calculation of w. The data files are provided in .csv (comma
separated values) format. The script fitpoly incomplete.py shows how to load the files.
All problems except problem 1 require that you provide some “written” answer (in some cases also figures),
so you will also submit a .pdf of your written answers. (You can use LATEX or any other system (including
handwritten; plots, of course, must be program-generated) as long as the final version is in PDF.)
The final submission will include (minimally) the three scripts and a PDF version of your
written part of the assignment. You are required to create either a .zip or tarball (.tar.gz /
.tgz) archive of all of the files for your submission and submit the archive to the d2l dropbox
by the date/time deadline above.
NOTE: Problems 6 and 7 are required for Graduate students only; Undergraduates may complete them for
extra credit equal to the point value.
(FCMA refers to the course text: Rogers and Girolami (2012), A First Course in Machine Learning. For
general notes on using LATEX to typeset math, see: http://en.wikibooks.org/wiki/LaTeX/Mathematics)
1
1. [2 points] Adapted from Exercise 1.2 of FCMA p.35:
Write a Python script that can find the parameters w for an arbitrary dataset of xn, tn pairs. You
will use this script to answer problem 2, which only requires fitting a simple line to the data (i.e., you
only need to fit parameters w0 and w1); however, in problems 5 and 8 you will need to fit higher-order
polynomial models (e.g., t = w0+w1x+w2x
2+…), so it is recommended that your script is generalized
to handle higher-order polynomials. The script fitpoly incomplete.py is provided to help get you
started. fitpoly incomplete.py provides helper functions to read data, plot data, and plot the model
(once you’ve determined the weight vector w), but the function for computing linear least-squares fit,
fitpoly (starting on line 36), is incomplete – you need to fill this in – it takes as input the data vector x,
the target values vector t, and the scalar model order, which is an integer representing the polynomial
order of the model; fitpoly is intended to return the w (column) vector (e.g., as a numpy matrix such
as the following:
>>> m = numpy.matrix([1,2,3]) # this creates a matrix with one row
matrix([[1, 2, 3]])
>>> m.transpose() # transposition makes returns a column vector
matrix([[1],
[2],
[3]])
>>> numpy.matrix([[1],[2],[3]]) # or we could specify the column directly
matrix([[1],
[2],
[3]])
The above isn’t literally the code you’d use, but shows how you can create and transpose matrices, with
row and column vectors simply being limiting cases of matrices with one row or 3 rows with 1 value
each, respectively). You will want to look at the numpy package, including the linear algebra package
numpy.linalg, for linear algebra operators. Also, you don’t have to use fitpoly incomplete.py; you
can write your own script from scratch. Whichever you choose, your script must have the ability to
plot the data with the fitted model.
Just to state the obvious: the objective of this exercise is for you to implement the linear least squares fit
solution (i.e., the normal equations) – in their general matrix form. DO NOT use existing least squares
solvers, such as numpy.linalg.lstsq, or scikit learn’s sklearn.linear model.LogisticRegression;
however, it is certainly fine to use these to help verify your implementation’s output.
You will submit your script as a stand-alone file called fitpoly.py.
Solution:
2
2. [1 point] Adapted from Exercise 1.6 of FCMA p.35:
Table 1.3 (p.13) of FCMA lists the women’s 100m gold medal Olympic results – this data is provided
in the file womens100.csv in the data folder. Using your script from problem 1, find the 1st-order
polynomial model (i.e., a line with parameters w0 and w1) that minimizes the squared loss of this data.
Report the model here as a linear equation and also include a figure that plots your model with the
data.
Solution:
3. [1 point] Adapted from Exercise 1.7 of FCMA p.35:
Using the model obtained in the previous exercise, predict the women’s winning time at the 2012 and
2016 Olympic games; report the values here. What is the squared error of your model’s prediction for
Shelly-Ann Fraser-Price’s gold medal time of 10.75 seconds in the 2012 Olympics women’s 100m race?
Solution:
4. [1 point] Adapted from Exercise 1.9 of FCMA p.36:
Use your python script from problem 1 to load the data stored in the file synthdata2014.csv (in the
data folder). Fit a 3rd order polynomial function – f(x; w) = w0 + w1x + w2x
2 + w3x
3 – to this data
(if you extended the fitpoly incomplete.py script, then model order= 3). Present and describe the
parameters you obtain fitting to this data. Also, plot the data and your linear-fit model and include
the plot in your answer.
Solution:
5. [3 points] Write a script that implements K-fold cross-validation to choose the polynomial order (between orders 0 and 7) with the best predictive error for the synthdata2014.csv. The provided script
cv demo incomplete.py implements the synthetic data experiment described in Ch 1 (pp.31-32) of
the book; you are welcome to use and adapt any part of this code you like; keep in mind that this
script won’t successfully execute until you add the general (matrix form) normal equation calculation
on line 81. Note also that in the synthetic data experiment in cv demo incomplete.py, 1000 test data
points are generated in addition to the 100 data points used for 10-fold cross-validation; for problem
5, you won’t have this independent test set, only the data from synthdata.csv on which to perform
K-fold cross-validation.
Run your script with 10-fold cross-validation and Leave-One-Out-CV (LOOCV). Which model order
do the two cross-validation methods predict as the best order for predictive accuracy? Do the two
different cross-validation runs agree?
Report the best-fit model parameters for the best model order according to 10-fold CV, and plot this
model with the data. Include a plot of the CV-loss and training loss for the 8 different (0..7) polynomial
model orders for both the 10-fold cross-validation and for LOOCV.
You will submit your script as a stand-alone file called cv.py
Solution:
3
6. [2 points – Required only for Graduates] Exercise 1.10 from FCMA p.36
Derive the optimal least squares parameter value, ˆw, for the total training loss:
L =
X
N
n=1
tn − w>xn
2
How does the expression compare with that derived from the average (mean) loss? (Hint: Express this
loss in the full matrix form and derive the normal equation.)
Solution:
7. [2 points – Required only for Graduates] Exercise 1.11 from FCMA p.36
The following expression is known as the weighted average loss:
L =
1
N
X
N
n=1
αn
tn − w>xn
2
where the influence of each data point is controlled by its associated parameter. Assuming that each
αn is fixed, derive the optimal least squares parameter value ˆw. (Hint: When expressing in the full
matrix form, the alpha’s become a matrix…)
Solution: gabi
8. [4 points] Variant of Exercise 1.12 from FCMA p.36
Write a new Python script that uses K-fold cross-validation to find the value of λ that gives the
(approximate) best predictive performance on the synthetic data (synthdata.csv) using regularized
least squares with a 7th order polynomial model. Report here the lambda you identified, the best fit
linear model using that lambda (as a linear equation) and include a plot of the MSE loss as a function
of λ that shows the loss curve with a minimum, and a plot of the best-fit model with the data.
You will submit the script as a stand-alone file called regularize.py.
Solution: