## Description

0.1 Instructions

1. Follow the honor code of the institute while doing any assignment. Any violation in that

would be taken quite seriously.

2. You can consult/discuss with any of your friend to develop the solution strategy. You

can also take help of your friend in setting up your machine. However, the final solution

and code should be written by you from scratch and you should not copy even a single

bit of it from others. You should acknowledge the help taken from your friend(s) in your

code at the top part (in comments section).

3. You will be required to submit one single .py file for the entire Assignment 1. The

submission needs to be done via Canvas only.

4. You should name the file as follows: RollNumber assignment1.py . Files not following

this naming convention will not be evaluated.

5. The submission should be done by 11:59 PM on the due date. Late submissions will be

penalized.

6. All the plots should be properly titled. The axes should have proper title and markers.

In any plot, the width of curves and markers (if any) should be chosen sufficiently so that

the plot is visible properly. Further, highlight gridlines or additional lines wherever it

make sense and wherever it adds more value to the plot.

7. For all the plots in the Assignment 1, test error should be plotted in Red color, training

error should be plotted in Blue color and R2

score in Black color.

8. For any kind of clarification on the problem definition and what you need to do in this

assignment, you can contact our TAs (Rachit and Harsha) via email communication in

Canvas. You can also post your queries in the announcement section of Canvas and let

your friends or TAs answer that eventually. You also feel free to answer the queries of

others on canvas (but don’t provide the solution).

1

0.2 Setting up Your System

Setup your system for this as well all the future assignments

• Install Anaconda on your machine. You can get more details from the Assignment0

document already uploaded on canvas.

• Install any IDE of your choice (recommended: Sublime Text)

0.3 Familiarity with Python

All the assignments should be done in Python only. You may want to watch the videos from

the following play lists (or any other list of your choice) to acquaint yourself with the Python

prerequisites.

1. Corey Schafer

2. Sentdex – Python 3

3. Sentdex – Machine Learning with Python

Specifically, you will be needing the knowledge of following Python topics more often in order

to complete all the assignments: List, Numpy and Scipy. It is advisable that you familiarize

yourself with these topics properly eventually. Finally, you may read the Sklearn documentation

for help on Python’s inbuilt APIs for machine learning tasks.

0.4 Problem Set for Assignment 1

Load the Boston housing dataset from the Sklearn datasets and write a Python code to accomplish the following tasks:

1. Plot a curve with number of training examples on X-axis and Training and Test Error (both on the same plot in specific colors) for the Least Square Regression (without

regularization or say λ = 0 in regularization) on the Y -axis. You can use inbuilt function to fit the Least Square Regression Model. Use Data Normalization. You should

use 50%, 60%, 70%, 80%, 90%, 95%, and 99% as the different sizes for the training set

while plotting this curve. Test set is the remaining part of the dataset. Further, you

also need to make another plot having R2

score on the Y-axis and number of training

examples on the X-axis for the same set of experiments. Also repeat the same experiment with l2-regularization with values of λ as 0.01, 0.1, 1 using the inbuilt function for

l2-regularization. You should use a gridplot in matplotlib in order to show all these plots.

This grid will have 2 rows and 4 columns displaying the Training and Test error plots

in row 1 (different plots for different values of λ = 0, 0.01, 0.1, 1) and R2

score plots in

row 2 (different plots for different values of λ = 0, 0.01, 0.1, 1). Keep the scale of X

and Y axis same on all plots. Also every time you take a certain number of training

examples, shuffle the data.

2

2. Plot the curve with the value of λ in Ridge regression on the X-axis and Training and

Test Error (on the same plot) on the Y axis. Fit the ridge regression model using Sklearn

inbuilt function Ridge for l2 -regularization. Use normalization. You should vary the

values of λ as [0, 0.0001, 0.001, 0.01, 0.1, 1, 1.5, 2, 3, 4, 5]. You need to make such a plot

separately for each of the following training set size – 99%, 90%,80% and 70%. Keep

the scale of X and Y axis same on all plots. Furthermore, on each of these plots,

you should also plot the validation error by using the inbuilt cross validation function in

SKlearn (namely, RidgeCV – you are encouraged to read the documentation for RidgeCV

function). Use 5-folds for the cross validation. You should use the same set of λ values

for this validation error plots also. The validation error plot should also be in the same

color as test error but use dotted lines. So there will be 3 curves on same plot. Remember

that test error will be calculated on the test set whereas validation error has nothing to

do with the test set. Also make separate plots with value of λ on X- axis and R2

score

on Y-axis for each of the percentages of training examples mentioned above and value of

λ between 0 and 5 varying them as shown above. Again use the concept of matplotlib

grid for showing all these plots. Your grid will have 2 rows and 4 columns displaying

the Training,Test and Validation error plots in row 1 for different number of training

examples and R2

score plots in row 2.

3. Repeat the process in question 2 with l1 -regularization by using inbuilt functions for

LASSO regression and LASSO regression with cross validation.

4. Repeat the process in questions 1 and 2 but now use your own code for data normalization,

fitting least square regression model, ridge regression model, calculating training error,

test error and R2

score. You do not need to do cross validation for this question.

5. Repeat the tasks in questions 1 through 4 for the diabetes dataset.

3