# CSC 411 Machine Learning & Data Mining ASSIGNMENT # 1 Logistic Regression

\$40.00

## Description

1 Logistic Regression (30 points)
Suppose you are given a training set with N training examples, D = {(x
(1), t(1)), . . . ,(x
(N)
, t(N)
)},
where each input x
(i)
is a D-dimensional vector x
(i) = (x
(i)
1
, . . . , x
(i)
D )
T and the targets are binary,
i.e., t
(i) ∈ {0, 1}. Assume a model such that for each example p(t = 1|x
(i)
, w, w0) takes the form of a
logistic function:
p(t = 1|x
(i)
, w, w0) = σ(wT x
(i) + w0) = 1
1 + exp 

PD
d=1 wdx
(i)
d − w0

The likelihood function is defined as p(t
(1), t(2), . . . , t(N)
|x
(1)
, x
(2)
, . . . , x
(N)
, w, w0)
Further, assume that for regularization purposes, a Gaussian prior is placed on the weights w =
(w1, . . . , wD)
T
such that p(w) = N (w|0, α−1
I).
1. Define the loss function to be the negative loss posterior over the weights. Write the loss
function as simplified as possible (show all your derivations). In order to do so, assume that
the data is i.i.d (i.e., every example is drawn independently and its identically distributed).
2. Derive what the gradient of the loss is with respect to wi and w0, i.e., ∂Loss
∂wi
,
∂Loss
∂w0
.
2 Decision Trees (16 points)
Assume you are given data consisting of a training set of 5 examples (Table 1) and a test set of 3
examples (Table 2).
1. Using the training data, construct a decision tree for the binary classification of customers of
the restaurant “Mama’s Pasta” into ‘Satisfied’ or ‘Unsatisfied’. Use the Information Gain (IG)
as the decision criterion to select which attribute to split on. Show your calculations for the IG
for all possible attributes for every split.
2. Now use the decision tree you have created to predict whether each of the test users will be
satisfied or not after their visit to “Mama’s Pasta”.
Person ID Overcooked pasta? Waiting time Rude waiter? Satisfied?
1 Yes Long No Yes
2 No Short Yes Yes
3 Yes Long Yes No
4 No Long Yes Yes
5 Yes Short Yes No
Table 1: Training Data
Person ID Overcooked pasta? Waiting time Rude waiter?
6 No Short No
7 Yes Long Yes
8 Yes Short No
Table 2: Test Data
3 Logistic Regression vs. KNN (54 points)
In this section you will compare the performance and characteristics of different classifiers, namely
k-Nearest Neighbors and Logistic Regression. You will extend the provided code and experiment
with these extensions. Note that you should understand the code first instead of using it as a black
box. Python1 version of the code has been provided.
The data you will be working with are hand-written digits, 2s and 8s, represented as 28×28 pixel
arrays. There are two training sets: mnist train, which contains 100 examples of each class,
and mnist train small, which contains 5 examples of each class. There is also a validation set
mnist valid (with 25 examples each class) that you should use for model selection, and a test set
mnist test.
Code for visualizing the datasets has been included in plot digits. check grad can be used for
data.
2.1 k-Nearest Neighbors (17 points)
Implement the kNN function inside run knn, you can use the utility function from l2 distance
to compute distance between different data points.
Write a separate script that runs kNN for different values of k ∈ {1, 3, 5, 7, 9} and plot the classification rate on the validation set (number of correctly predicted cases, divided by total number of data
points) as a function of k.
1. Comment on the performance of the classifier and argue which value of k you would choose.
What is the classification rate on test set of your chosen value k

? Also compute the rate for
1You should use Python 2.7 with both the Numpy and Matplotlib packages installed.
k
∗ + 2 and k
∗ − 2 on test set.
2. Does the test performance for these values of k correspond to the validation performance?2
Why or why not?
2.2 Logistic regression (17 points)
Look through the code in logistic regression template and logistic. Complete the implementation of logistic regression by providing the missing part of those two files.
Write a separate script or use logistic regression template so you can experiment with the
hyperparameters for the learning rate, the number of iterations, and the way in which you initialize
the weights.
As a practice guide, before training, you should use check grad to make sure your implementation
of gradient computation is correct; during training, if you get Nan/Inf errors, you may try to reduce
your learning rate or initialize with smaller weights. If you have a smaller learning rate, your model
will take longer to converge.
1. Report which hyperparameter settings you found worked the best and the final cross entropy
and classification error on the training (mnist train), validation and test sets. Note that you
should only compute the test error once you have selected your best hyperparameter settings
using the validation set.
2. using the your chosen hyperparameter, report how the loss (cross entropy) changes during
training. You need to submit 2 plots, one for each of mnist train and mnist train small.
In each plot show two curves one for the training set and one for the validation set. Comment
on your results. Run your code several times and observe if the results change, why or why
not?
2.3 Regularized logistic regression (20 points)
Implement the penalized logistic regression model you derived in Section 1 by modifying logistic
to include a regularizer (implement the new function logistic pen). You will need to change the
the C(α) term, since its derivative is 0 w.r.t. the weights and bias. Use check grad to verify the
Write a scripts to do experiments with different hyperparameters including different values of the
penalty parameter α. Try α ∈ {0.001, 0.01, 0.1, 1.0}. At this stage you should not be evaluating on
the test set as you will do so once you have chosen your best α.
1. Similar to Section 2.2, report the best hyperparameters and corresponding test set error. You
will also need to submit two plots for training on mnist train and mnist train small,
2
In general you shouldn’t peek at the test set multiple times, but for the purposes of this question it can be an illustrative exercise.
each should also include two curves of how the loss (negative log posterior) changes, one for
training set and the other for validation set.
2. Given other hyperprameters fixed, how does the loss (negative log posterior) and classification error change when you increase α? Do they go up, down, first up and then down, or
down and then up? Explain why you think they behave this way. Which is the best value of