Description
0.1 Instructions
1. Follow the honor code of the institute while doing any assignment. Any violation in that
would be taken quite seriously.
2. You can consult/discuss with any of your friend to develop the solution strategy. You
can also take help of your friend in setting up your machine. However, the final solution
and code should be written by you from scratch and you should not copy even a single
bit of it from others. You should acknowledge the help taken from your friend(s) in your
code at the top part (in comments section).
3. You will be required to submit one single .py file for the entire Assignment 1. The
submission needs to be done via Canvas only.
4. You should name the file as follows: RollNumber assignment1.py . Files not following
this naming convention will not be evaluated.
5. The submission should be done by 11:59 PM on the due date. Late submissions will be
penalized.
6. All the plots should be properly titled. The axes should have proper title and markers.
In any plot, the width of curves and markers (if any) should be chosen sufficiently so that
the plot is visible properly. Further, highlight gridlines or additional lines wherever it
make sense and wherever it adds more value to the plot.
7. For all the plots in the Assignment 1, test error should be plotted in Red color, training
error should be plotted in Blue color and R2
score in Black color.
8. For any kind of clarification on the problem definition and what you need to do in this
assignment, you can contact our TAs (Rachit and Harsha) via email communication in
Canvas. You can also post your queries in the announcement section of Canvas and let
your friends or TAs answer that eventually. You also feel free to answer the queries of
others on canvas (but don’t provide the solution).
1
0.2 Setting up Your System
Setup your system for this as well all the future assignments
• Install Anaconda on your machine. You can get more details from the Assignment0
document already uploaded on canvas.
• Install any IDE of your choice (recommended: Sublime Text)
0.3 Familiarity with Python
All the assignments should be done in Python only. You may want to watch the videos from
the following play lists (or any other list of your choice) to acquaint yourself with the Python
prerequisites.
1. Corey Schafer
2. Sentdex – Python 3
3. Sentdex – Machine Learning with Python
Specifically, you will be needing the knowledge of following Python topics more often in order
to complete all the assignments: List, Numpy and Scipy. It is advisable that you familiarize
yourself with these topics properly eventually. Finally, you may read the Sklearn documentation
for help on Python’s inbuilt APIs for machine learning tasks.
0.4 Problem Set for Assignment 1
Load the Boston housing dataset from the Sklearn datasets and write a Python code to accomplish the following tasks:
1. Plot a curve with number of training examples on X-axis and Training and Test Error (both on the same plot in specific colors) for the Least Square Regression (without
regularization or say λ = 0 in regularization) on the Y -axis. You can use inbuilt function to fit the Least Square Regression Model. Use Data Normalization. You should
use 50%, 60%, 70%, 80%, 90%, 95%, and 99% as the different sizes for the training set
while plotting this curve. Test set is the remaining part of the dataset. Further, you
also need to make another plot having R2
score on the Y-axis and number of training
examples on the X-axis for the same set of experiments. Also repeat the same experiment with l2-regularization with values of λ as 0.01, 0.1, 1 using the inbuilt function for
l2-regularization. You should use a gridplot in matplotlib in order to show all these plots.
This grid will have 2 rows and 4 columns displaying the Training and Test error plots
in row 1 (different plots for different values of λ = 0, 0.01, 0.1, 1) and R2
score plots in
row 2 (different plots for different values of λ = 0, 0.01, 0.1, 1). Keep the scale of X
and Y axis same on all plots. Also every time you take a certain number of training
examples, shuffle the data.
2
2. Plot the curve with the value of λ in Ridge regression on the X-axis and Training and
Test Error (on the same plot) on the Y axis. Fit the ridge regression model using Sklearn
inbuilt function Ridge for l2 -regularization. Use normalization. You should vary the
values of λ as [0, 0.0001, 0.001, 0.01, 0.1, 1, 1.5, 2, 3, 4, 5]. You need to make such a plot
separately for each of the following training set size – 99%, 90%,80% and 70%. Keep
the scale of X and Y axis same on all plots. Furthermore, on each of these plots,
you should also plot the validation error by using the inbuilt cross validation function in
SKlearn (namely, RidgeCV – you are encouraged to read the documentation for RidgeCV
function). Use 5-folds for the cross validation. You should use the same set of λ values
for this validation error plots also. The validation error plot should also be in the same
color as test error but use dotted lines. So there will be 3 curves on same plot. Remember
that test error will be calculated on the test set whereas validation error has nothing to
do with the test set. Also make separate plots with value of λ on X- axis and R2
score
on Y-axis for each of the percentages of training examples mentioned above and value of
λ between 0 and 5 varying them as shown above. Again use the concept of matplotlib
grid for showing all these plots. Your grid will have 2 rows and 4 columns displaying
the Training,Test and Validation error plots in row 1 for different number of training
examples and R2
score plots in row 2.
3. Repeat the process in question 2 with l1 -regularization by using inbuilt functions for
LASSO regression and LASSO regression with cross validation.
4. Repeat the process in questions 1 and 2 but now use your own code for data normalization,
fitting least square regression model, ridge regression model, calculating training error,
test error and R2
score. You do not need to do cross validation for this question.
5. Repeat the tasks in questions 1 through 4 for the diabetes dataset.
3