Description
Linear Models for Handwritten Digits Classification: In this assignment, you will implement the binary logistic regression model and multi-class logistic regression model on a partial
dataset from MNIST. In this classification task, the model will take a 16 ×16 image of handwritten
digits as inputs and classify the image into different classes. For the binary case, the classes are 1
and 2 while for the multi-class case, the classes are 0, 1, and 2. The “data” fold contains the dataset
which has already been split into a training set and a testing set. All data examples are saved in
dictionary-like objects using “npz” file. For each data sample, the dictionary key ‘x’ indicates its
raw features, which are represented by a 256-dimensional vector where the values between [−1, 1]
indicate grayscale pixel values for a 16 × 16 image. In addition, the key ’y’ is the label for a data
example, which can be 0, 1, or 2. The “code” fold provides the starting code. You must implement
the models using the starting code.
1. Data Preprocessing [15 points]: In this problem, you need to finish “code/DataReader.py”.
(a) Explain what the function train valid split does and why we need this step.
(b) Before testing, is it correct to re-train the model on the whole training set? Explain
your answer.
1
(c) In this assignment, we use two hand-crafted features:
The first feature is a measure of symmetry. For a 16 × 16 image x, it is defined as
Fsymmetry = −
P
pixel |x − flip(x)|
256
,
where 256 is the number of pixels and flip(·) means left and right flipping.
The second feature is a measure of intensity. For a 16 × 16 image x, it is defined as
Fintensity =
P
pixel x
256
,
which is simply the average of pixel values.
Implement them in the function prepare X.
(d) In the function prepare X, there is a third feature which is always 1. Explain why we
need it.
(e) The function prepare y is already finished. Note that the returned indices stores the
indices for data from class 1 and 2. Only use these two classes for binary classification
and convert the labels to +1 and -1 if necessary.
(f) Test your code in “code/main.py” and visualize the training data from class 1 and 2 by
implementing the function visualize features. The visualization should not include the
third feature. Therefore it is a 2-D scatter plot. Include the figure in your submission.
2. Cross-entropy loss [20 points]: In logistic regression, we use the cross-entropy loss.
(a) Write the loss function E(w) for one training data sample (x, y). Note that the binary
labels are 1 and −1.
(b) Compute the gradient ∇E(w). Please provide intermediate steps of derivation.
(c) Once the optimal w is obtained, it can be used to make predictions as follows:
Predicted class of x =
(
1 if θ(w
T x) ≥ 0.5
−1 if θ(w
T x) < 0.5
where the function θ(z) = 1
1+e−z looks like
However, this is not the most efficient way since the decision boundary is linear. Why?
Explain it. When will we need to use the sigmoid function in prediction?
(d) Is the decision boundary still linear if the prediction rule is changed to the following?
Justify briefly.
Predicted label of x =
(
1 if θ(w
T x) ≥ 0.9
−1 if θ(w
T x) < 0.9
2
(e) In light of your answers to the above two questions, what is the essential property of
logistic regression that results in the linear decision boundary?
3. Sigmoid logistic regression [25 points]: In this problem, you need to finish “code/LogisticRegression.py”.
Please follow the instructions in the starting code. Please use data from class 1
and 2 for the binary classification.
(a) Based on (b) in the last problem, implement the function gradient.
(b) There are different ways to train a logistic regression model. In this assignment, you
need to implement batch gradient descent, stochastic gradient descent and mini-batch
gradient descent in the functions f it BGD, f it SGD and f it miniBGD, respectively.
Note that: In batch gradient descent, each model update is based on all training samples.
In stochastic gradient descent, each model update is based on one training sample. In
mini-batch gradient descent, you split your training data into (roughly equal-sized) minibatches, and each model update is based on training samples in each mini-batch. Thus,
BGD and SDG are actually special cases of mini-BGD. An epoch is completed when
each training sample is used to update the model exactly once. In SDG and mini-BGD,
you might want to reshuffle your samples before each epoch.
(c) Implement the functions predict and score for prediction and evaluation, respectively.
Additionally, please implement the function predict proba which outputs the probabilities of both classes.
(d) Test your code in “code/main.py” and visualize the results after training by using the
function visualize results. Include the figure in your submission.
(e) Implement the testing process and report the test accuracy of your best logistic regression
model.
4. Softmax logistic regression [20 points]: In this problem, you need to finish “code/LRM.py”.
Please follow the instructions in the starting code.
(a) Based on the course notes, implement the function gradient.
(b) In this assignment, you only need to implement mini-batch gradient descent in the
function f it miniBGD.
(c) Implement the functions predict and score for prediction and evaluation, respectively.
(d) Test your code in “code/main.py” and visualize the results after training by using the
function visualize results multi. Include the figure in your submission.
(e) Implement the testing process and report the test accuracy of your best logistic regression
model.
5. Softmax logistic vs Sigmoid logistic [20 points]: In this problem, you need to experimentally
compare these two methods. Please follow the instructions in the starting code. Use
data examples from class 1 and 2 for classification.
(a) Train the softmax logistic classifier and the sigmoid logistic classifier using the same
data until convergence. Compare these two classifiers and report your observations and
insights.
(b) Explore the training of these two classifiers and monitor the graidents/weights. How can
we set the learning rates so that w1 − w2 = w holds for all training steps?
3