## Description

1 Introduction

In this assignment, we will extend the first programming assignment in solving the problem of handwritten

digit classification. In particular, your task is to implement Logistic Regression and use the Support Vector

Machine tool in sklearn.svm.SVM to classify hand-written digit images and compare the performance of

these methods.

To get started with the exercise, you will need to download the supporting files and unzip its contents to

the directory you want to complete this assignment.

1.1 Datasets

In this assignment, we still use the same data set of first programming assignment – MNIST. In the script

file provided to you, we have implemented a function preprocess() with preprocessing steps (apply feature

selection and feature normalization) and divide data set into 3 parts: training set, validation set and testing

set.

2 Your tasks

• Implement Logistic Regression and give the prediction results.

• Use Support Vector Machine (SVM) toolbox to perform classification.

• Write a report to explain the experimental results with these 2 methods.

• Extra credit: Implement the direct minimization of multi-class Logistic Regression (using softmax

function).

1

2.1 Logistic Regression

Consider x ∈ R

D as an input vector. We want to classify x into correct class C1 or C2 (denoted as a random

variable y). In Logistic Regression, the posterior probability of class C1 can be written as follow:

P(y = C1|x) = σ(wT x + w0)

where w ∈ R

D is the weight vector.

For simplicity, we will denote x = [1, x1, x2, · · · , xD] and w = [w0, w1, w2, · · · , wD]. With this new notation,

the posterior probability of class C1 can be rewritten as follow:

P(y = C1|x) = σ(wT x) (1)

And posterior probability of class C2 is:

P(y = C2|x) = 1 − P(y = C1|x)

We now consider the data set {x1, x2, · · · , xN } and corresponding label {y1, y2, · · · , yN } where

yi =

1 if xi ∈ C1

0 if xi ∈ C2

for i = 1, 2, · · · , N.

With this data set, the likelihood function can be written as follow:

p(y|w) = Y

N

n=1

θ

yn

n

(1 − θn)

1−yn

where θn = σ(wT xn) for n = 1, 2, · · · , N.

We also define the error function by taking the negative logarithm of the log likelihood, which gives the

cross-entropy error function of the form:

E(w) = −

1

N

ln p(y|w) = −

1

N

X

N

n=1

{yn ln θn + (1 − yn) ln(1 − θn)} (2)

Note that this function is different from the squared loss function that we have used for Neural Networks

and Perceptrons.

The gradient of error function with respect to w can be obtained as follow:

∇E(w) = 1

N

X

N

n=1

(θn − yn)xn (3)

Up to this point, we can use again – gradient descent – to find the optimal weight wb to minimize the error

function with the formula:

wnew = wold − η∇E(wold) (4)

2.1.1 Implementation

You are asked to implement Logistic Regression to classify hand-written digit images into correct corresponding labels. The data is the same as the one you used for the first programming assignment. Since the

labels associated with each digit can take one out of 10 possible values (multiple classes), we cannot directly

use a binary logistic regression classifier. Instead, we employ the one-vs-all strategy. In particular, you have

to build 10 binary-classifiers (one for each class) to distinguish that class from all other classes. In order to

implement Logistic Regression, you have to complete function blrObjFunction() provided in the base code

(script.py). The input of blrObjFunction.m includes 3 parameters:

2

• X is a data matrix where each row contains a feature vector in original coordinate (not including the

bias 1 at the beginning of vector). In other words, X ∈ R

N×D. So you have to add the bias into

each feature vector inside this function. In order to guarantee the consistency in the code and utilize

automatic grading, please add the bias at the beginning of feature vector instead of the end.

• wk is a column vector representing the parameters of Logistic Regression. Size of wk is (D + 1) × 1.

• yk is a column vector representing the labels of corresponding feature vectors in data matrix X. Each

entry in this vector is either 1 or 0 to represent whether the feature vector belongs to a class Ck or not

(k = 0, 1, · · · , K − 1). Size of yk is N × 1 where N is the number of rows of X. The creation of yk is

already done in the base code.

Function blrObjFunction() has 2 outputs:

• error is a scalar value which is the result of computing equation (2)

• error grad is a column vector of size (D + 1) × 1 which represents the gradient of error function

obtained by using equation (3).

For prediction using Logistic Regression, given 10 weights vector of 10 classes, we need to classify a feature

vector into a certain class. In order to do so, given a feature vector x, we need to compute the posterior

probability P(y = Ck|x) and the decision rule is to assign x to class Ck that maximizes P(y = Ck|x). In

particular, you have to complete the function blrPredict() which returns the predicted label for each feature

vector. Concretely, the input of blrPredict() includes 2 parameters:

• Similar to function blrObjFunction(), X is also a data matrix where each row contains a feature vector

in original coordinate (not including the bias 1 at the beginning of vector). In other words, X has size

N × D. In order to guarantee the consistency in the code and utilize automatic grading, please add

the bias at the beginning of feature vector instead of the end.

• W is a matrix where each column is a weight vector (wk) of classifier for digit k. Concretely, W has

size (D + 1) × K where K = 10 is the number of classifiers.

The output of function blrPredict() is a column vector label which has size N × 1.

2.1.2 For Extra Credit – Direct Multi-class Logistic Regression

In this part, you are asked to implement multi-class Logistic Regression. Traditionally, Logistic Regression

is used for binary classification. However, Logistic Regression can also be extended to solve the multi-class

classification. With this method, we don’t need to build 10 classifiers like before. Instead, we now only need

to build 1 classifier that can classify 10 classes at the same time.

For multi-class Logistic Regression, the posterior probabilities are given by a softmax transformation of

linear functions of the feature variables, so that

P(y = Ck|x) = exp(wT

k x)

P

j

exp(wT

j x)

(5)

Now we write down the likelihood function. This is most easily done using the 1-of-K coding scheme in

which the target vector yn for a feature vector xn belonging to class Ck is a binary vector with all elements

zero except for element k, which equals one. The likelihood function is then given by

P(Y|w1, · · · , wK) = Y

N

n=1

Y

K

k=1

P(y = Ck|xn)

ynk =

Y

N

n=1

Y

K

k=1

θ

ynk

nk (6)

where θnk is given by (5) and Y is an N × K matrix (obtained using 1-of-K encoding) of target variables

with elements ynk. Taking the negative logarithm then gives

E(w1, · · · , wK) = − ln P(Y|w1, · · · , wK) = −

X

N

n=1

X

K

k=1

ynk ln θnk (7)

3

which is known as the cross-entropy error function for the multi-class classification problem.

We now take the gradient of the error function with respect to one of the parameter vectors wk . Making

use of the result for the derivatives of the softmax function, we obtain:

∂E(w1, · · · , wK)

∂wk

=

X

N

n=1

(θnk − ynk)xn (8)

then we could use the following updating function to get the optimal parameter vector w iteratively:

wnew

k ← wold

k − η

∂E(w1, · · · , wK)

∂wk

(9)

Note: This part is for the extra 20 credits.

2.2 Support Vector Machines

In this part of assignment, you are asked to use the Support Vector Machine tool in sklearn.svm.SVM to

perform classification on our data set. The details about the tool are provided here: http://scikit-learn.

org/stable/modules/generated/sklearn.svm.SVC.html.

Your task is to fill the code in Support Vector Machine section of script.py to learn the SVM model and

compute accuracy of prediction with respect to training data, validation data and testing using the following

parameters:

• Using linear kernel (all other parameters are kept default).

• Using radial basis function with value of gamma setting to 1 (all other parameters are kept default).

• Using radial basis function with value of gamma setting to default (all other parameters are kept

default).

• Using radial basis function with value of gamma setting to default and varying value of C (1, 10, 20, 30, · · · , 100)

and plot the graph of accuracy with respect to values of C in the report.

3 Submission

You are required to submit a single file called proj3.zip using UBLearns.

File proj3.zip must contain 3 files: report, params.pickle and script.py. The params.pickle file should contain

the weight matrix,W, learnt for the logistic regression

• Submit your report in a pdf format. Please indicate the team members, group number, and your course

number on the top of the report.

• The code file should contain all implemented functions. Please do not change the name of the file.

Using UBLearns Submission: Continue using the groups that you were in for programming

assignment 2. You should submit one solution per group through the groups page. If you want to

change the group, contact the instructors.

Project report: The hard-copy of report will be collected in class at due date. Your report should include

the experimental results you have performed using Logistic Regression and Support Vector Machine.

4

4 Grading scheme

• Implementation:

X blrObjFunction(): 20 points

X blrPredict(): 20 points

X script.py: 20 points (your code in SVM section)

• Project report: 30 points

• Accuracy of classification methods: 10 points

• Extra Credit: 20 points.

5