Description
HW05: Practice with algorithm selection, grid search, cross validation, multiclass classification, one-class classification, imbalanced data, and model selection.
[Please put your name and NetID here.]
Hello Students:
Start by downloading HW05.ipynb from this folder. Then develop it into your solution.
- Write code where you see “… your code here …” below. (You are welcome to use more than one cell.)
- If you have questions, please ask them in class, office hours, or piazza. Our TA and I are very happy to help with the programming (provided you start early enough, and provided we are not helping so much that we undermine your learning).
- When you are done, run these Notebook commands:
- Shift-L (once, so that line numbers are visible)
- Kernel > Restart and Run All (run all cells from scratch)
- Esc S (save)
- File > Download as > HTML
- Turn in:
- HW03.ipynb to Canvas’s HW03.ipynb assignment
- HW03.html to Canvas’s HW03.html assignment
- As a check, download your files from Canvas to a new ‘junk’ folder. Try ‘Kernel > Restart and Run All’ on the ‘.ipynb’ file to make sure it works. Glance through the ‘.html’ file.
- Turn in partial solutions to Canvas before the deadline. e.g. Turn in part 1, then parts 1 and 2, then your whole solution. That way we can award partial credit even if you miss the deadline. We will grade your last submission before the deadline.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import mixture
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import svm, linear_model, datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (confusion_matrix, precision_score, recall_score,
accuracy_score, roc_auc_score, RocCurveDisplay)
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from imblearn.over_sampling import RandomOverSampler
1. Algorithm selection for multiclass classification by optical recognition of handwritten digits
The digits dataset has 1797 labeled images of hand-written digits.
- 𝑋 =
digits.data
has shape (1797, 64).- Each image 𝑥𝑖 is represented as the 𝑖th row of 64 pixel values in the 2D
digits.data
array that corresponds to an 8×8 photo of a handwritten digit.
- Each image 𝑥𝑖 is represented as the 𝑖th row of 64 pixel values in the 2D
- 𝑦 =
digits.target
has shape (1797,). Each 𝑦𝑖 is a number from 0 to 9 indicating the handwritten digit that was photographed and stored in 𝑥𝑖.
1(a) Load the digits dataset and split it into training, validation, and test sets as I did in the lecture example code 07ensemble.html.
This step does not need to display any output.
# ... your code here ...
1(b) Use algorithm selection on training and validation data to choose a best classifier.
Loop through these four classifiers and corresponding parameters, doing a grid search to find the best hyperparameter setting. Use only the training data for the grid search.
- SVM:
- Try all values of
kernel
in ‘linear’, ‘rbf’. - Try all values of
C
in 0.01, 1, 100.
- Try all values of
- logistic regression:
- Use
max_iter=5000
to avoid a nonconvergence warning. - Try all values of
C
in 0.01, 1, 100.
- Use
- ID3 decision tree:
- Use
criterion='entropy
to get our ID3 tree. - Try all values of
max_depth
in 1, 3, 5, 7.
- Use
- kNN:
- (Use the default Euclidean distance).
- Try all values of
n_neighbors
in 1, 2, 3, 4.
Hint:
- Make a list of the four classifiers without setting any hyperparameters.
- Make a list of four corresponding parameter dictionaries.
- Loop through 0, 1, 2, 3:
- Run grid search on the 𝑖th classifier with the 𝑖th parameter dictionary on the training data. (The grid search does its own cross-validation using the training data.)
- Use the 𝑖th classifier with its best hyperparameter settings (just
clf
fromclf = GridSearchCV(...)
) to find the accuracy of the model on the validation data, i.e. findclf.score(X_valid, y_valid)
.
- Keep track, as your loop progresses, of:
- the index 𝑖 of the best classifier (initialize it to
-1
or some other value) - the best accuracy score on validation data (initialize it to
-np.Inf
) - the best classifier with its hyperparameter settings, that is the best
clf
fromclf = GridSearchCV(...)
(initialize it toNone
or some other value)
- the index 𝑖 of the best classifier (initialize it to
I needed about 30 lines of code to do this. It took a minute to run.
# ... your code here ...
1(c) Use the test data to evaluate your best classifier and its hyperparameter settings from 1(b).
- Report the result of calling
.score(X_test, y_test)
on your best classifier/hyperparameters. - Make a confusion matrix from the true
y_test
values and the corresponding 𝑦^ values predicted by your best classifier/hyperparameters onX_test
. - For each of the wrong predictions (where
y_test
and your 𝑦^ values disagree), show:- The index 𝑖 in the test data of that example 𝑥
- The correct label 𝑦𝑖
- Your incorrect prediction 𝑦^𝑖
- A plot of that image
# ... your code here ...
2. One-class classification (outlier detection)
2(a) There is an old gradebook of mine at http://pages.stat.wisc.edu/~jgillett/451/data/midtermGrades.txt.
Use pd.read_table()
to read it into a DataFrame.
Hint: pd.read_table()
has many parameters. Check its documentation to find three parameters to:
- Read from the given URL
- Use the separator ‘\s+’, which means ‘one or more whitespace characters’
- Skip the first 12 rows, as they are a note to students and not part of the gradebook
# ... your code here ...
df = pd.read_table(filepath_or_buffer='https://pages.stat.wisc.edu/~jgillett/451/data/midtermGrades.txt',
sep='\s+', skiprows=12)
2(b) Use clf = mixture.GaussianMixture(n_components=1)
to make a one-class Gaussian model to decide which 𝑥=(Exam1,Exam2) are outliers:
- Set a matrix X to the first two columns, Exam1 and Exam.
- These exams were worth 125 points each. Transform scores to percentages in [0,100].
Hint: I tried the MinMaxScaler() first, but it does the wrong thing if there aren’t scores of 0 and 125 in each column. So I just multiplied the whole matrix by 100 / 125.
- Fit your classifier to X.
Hint:
- The reference page for
mixture.GaussianMixture
includes afit(X, y=None)
method with the comment that y is ignored (as this is an unsupervised learning algorithm–there is no 𝑦) but present for API consistency. So we can fit with just X. - I got a warning about “KMeans … memory leak”. You may ignore this warning if you see it. I still got satisfactory results.
- The reference page for
- Print the center 𝜇 and covariance matrix 𝛴 from the two-variable 𝑁2(𝜇,𝛴) distribution you estimated.
# ... your code here ...
2(c) Here I have given you code to make a contour plot of the negative log likelihood −ln𝑓𝜇,𝛴(𝑥) for 𝑋∼𝑁2(𝜇,𝛴), provided you have set clf
.
# make contour plot of log-likelihood of samples from clf.score_samples()
margin = 10
x = np.linspace(0 - margin, 100 + margin)
y = np.linspace(0 - margin, 100 + margin)
grid_x, grid_y = np.meshgrid(x, y)
two_column_grid_x_grid_y = np.array([grid_x.ravel(), grid_y.ravel()]).T
negative_log_pdf_values = -clf.score_samples(two_column_grid_x_grid_y)
grid_z = negative_log_pdf_values
grid_z = grid_z.reshape(grid_x.shape)
plt.contour(grid_x, grid_y, grid_z, levels=10) # X, Y, Z
plt.title('(Exam1, Exam2) pairs')
Paste my code into your code cell below and add more code:
- Add black 𝑥– and 𝑦– axes. Label them Exam1 and Exam2.
- Plot the data points in blue.
- Plot 𝜇=
clf.means_
as a big lime dot. - Overplot (i.e. plot again) in red the 8 outliers determined by a threshold consisting of the 0.02 quantile of the pdf values 𝑓𝜇,𝛴(𝑥) for each 𝑥 in X.
Hint:
clf.score_samples(X)
gives log likelihood, sonp.exp(clf.score_samples(X))
gives the required 𝑓𝜇,𝛴(𝑥) values.
# ... your code here ...
What characterizes 7 of these 8 outliers? Write your answer in a markdown cell.
# ... your English text in a Markdown cell here ...
2(d) Write a little code to report whether, by the 0.02 quantile criterion, 𝑥= (Exam1=50, Exam2=100) is an outlier.
Hint: Compare 𝑓𝜇,𝛴(𝑥) to your threshold
# ... your code here ...
3. Explore the fact that accuracy can be misleading for imbalanced data.
Here I make a fake imbalanced data set by randomly sampling 𝑦 from a distribution with 𝑃(𝑦=0)=0.980 and 𝑃(𝑦=1)=0.020.
X, y = make_classification(n_samples=1000, n_features=4, n_classes=2, weights=[0.980, 0.020],
n_clusters_per_class=1, flip_y=0.01, random_state=0)
print(f'np.bincount(y)={np.bincount(y)}; we expect about 980 zeros and 20 ones.')
print(f'np.mean(y)={np.mean(y)}; we expect the proportion of ones to be about 0.020.')
np.bincount(y)=[973 27]; we expect about 980 zeros and 20 ones. np.mean(y)=0.027; we expect the proportion of ones to be about 0.020.
Here I split the data into 50% training and 50% testing data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
random_state=0, stratify=y)
print(f'np.bincount(y_train)={np.bincount(y_train)}')
print(f'np.mean(y_train)={np.mean(y_train)}.')
print(f'np.bincount(y_test)={np.bincount(y_test)}.')
print(f'np.mean(y_test)={np.mean(y_test)}.')
np.bincount(y_train)=[486 14] np.mean(y_train)=0.028. np.bincount(y_test)=[487 13]. np.mean(y_test)=0.026.
3a. Train and assess a gradient boosting model.
- Train on the training data.
- Use 100 trees of maximum depth 1 and learning rate 𝛼=0.25.
- Use
random_state=0
(so that teacher, TAs, and students have a chance of getting the same results). - Display the accuracy, precision, recall, and AUC on the test data. Use 3 decimal places. Use a labeled print statement with 3 decimal places so the reader can easily find each metric.
- Make an ROC curve from your classifier and the test data.
# ... your code here ...
Note the high accuracy but lousy precision, recall, and AUC.
Note that since the data have about 98% 𝑦=0, we could get about 98% accuracy by just always predicting 𝑦^=0. High accuracy alone is not necessarily helpful.
3b. Now oversample the data to get a balanced data set.
- Use the
RandomOverSampler(random_state=0)
to oversample and get a balanced data set. - Repeat my
train_test_split()
block from above. - Repeat your train/assess block from above.
# ... your code here ...
Note that we traded a little accuracy for much improved precision, recall, and AUC.
If you do classification in your project and report accuracy, please also report the proportions of 𝑦=0 and 𝑦=1 in your test data so that we get insight into whether your model improves upon always guessing 𝑦^=0 or always guessing 𝑦^=1.