## Description

1 Introduction

In this assignment, your task is to implement a Multilayer Perceptron Neural Network and evaluate its

performance in classifying handwritten digits. After completing this assignment, you should be able to

understand:

• How Neural Network works and use Feed Forward, Back Propagation to implement Neural Network?

• How to setup a Machine Learning experiment on real data?

• How regularization plays a role in the bias-variance tradeoff?

To get started with the exercise, you will need to download the supporting files and unzip its contents to

the directory you want to complete this assignment.

Warning: In this project, you will have to handle many computing intensive tasks such as training

a neural network. Our suggestion is to use the CSE server Metallica (this server is dedicated to intensive

computing tasks) to run your computation. In addition, training such a big dataset will take a very long

time, maybe many hours or even days to complete. Therefore, we suggest that you should start doing

this project as soon as possible so that the computer will have time to do heavy computational jobs.

1.1 File included in this exercise

• mnist all.mat: original dataset from MNIST. In this file, there are 10 matrices for testing set and 10

matrices for training set, which corresponding to 10 digits. You will have to split the training data

into training and validation data.

• nnScript.py: Python script for this programming project. Contains function definitions –

– preprocess(): performs some preprocess tasks, and output the preprocessed train, validation and

test data with their corresponding labels. You need to make changes to this function.

– sigmoid(): compute sigmoid function. The input can be a scalar value, a vector or a matrix. You

need to make changes to this function.

– nnObjFunction(): compute the error function of Neural Network. You need to make changes to

this function.

– nnPredict(): predicts the label of data given the parameters of Neural Network. You need to make

changes to this function.

– initializeWeights(): return the random weights for Neural Network given the number of unit in

the input layer and output layer.

1

1.2 Datasets

The MNIST dataset [1] consists of a training set of 60000 examples and test set of 10000 examples. All

digits have been size-normalized and centered in a fixed image of 28 × 28 size. In original dataset, each pixel

in the image is represented by an integer between 0 and 255, where 0 is black, 255 is white and anything

between represents different shade of gray.

In many research papers, the official training set of 60000 examples is divided into an actual training set of

50000 examples and validation set of 10000 examples. So we suggest that you should follow this convention.

2 Your tasks

• Implement Neural Network (forward pass and back propagation)

• Incorporate regularization on the weights (λ)

• Use validation set to tune hyper-parameters for Neural Network (number of units in the hidden layer

and λ).

• Write a report to explain the experimental results.

3 Some practical tips in implementation

3.1 Preprocessing

The preprocessing needs to do three things:

1. Stack all training matrices into one 60000 × 784 matrix. Do the same for test matrices.

2. Create a 60000 length vector with true labels (digits) for each training example. Same for test data.

3. Normalize the training matrix and test matrix so that the values are between 0 and 1.

4. Randomly split the 60000 × 784 normalized matrix into two matrices: training matrix (50000 × 784)

and validation matrix (10000 × 784). Make sure you split the true labels vector into two parts as well.

5. Feature selection – see next subsection.

3.2 Feature selection

In the dataset, one can observe that there are many features which values are exactly the same for all data

points in the training set. With those features, the classification models cannot gain any more information

about the difference (or variation) between data points. Therefore, we can ignore those features in the

pre-processing step.

Later on in this course, you will learn more sophisticated models to reduce the dimension of dataset (but

not for this assignment).

3.3 Neural Network

3.3.1 Neural Network Representation

Neural network can be graphically represented as in Figure 1.

As observed in the Figure 1, there are totally 3 layers in the neural network:

• The first layer comprises of (d + 1) units, each represents a feature of image (there is one extra unit

representing the bias).

2

Figure 1: Neural network

• The second layer in neural network is called the hidden units. In this document, we denote m + 1

as the number of hidden units in hidden layer. There is an additional bias node at the hidden layer

as well. Hidden units can be considered as the learned features extracted from the original data set.

Since number of hidden units will represent the dimension of learned features in neural network, it’s

our choice to choose an appropriate number of hidden units. Too many hidden units may lead to the

slow training phase while too few hidden units may cause the the under-fitting problem.

• The third layer is also called the output layer. The value of l

th unit in the output layer represents

the probability of a certain hand-written image belongs to digit l. Since we have 10 possible digits,

there are 10 units in the output layer. In this document, we denote k as the number of output units

in output layer.

The parameters in Neural Network model are the weights associated with the hidden layer units and the

output layers units. In our standard Neural Network with 3 layers (input, hidden, output), in order to

represent the model parameters, we use 2 matrices:

• W(1) ∈ R

m×(d+1) is the weight matrix of connections from input layer to hidden layer. Each row in

this matrix corresponds to the weight vector at each hidden layer unit.

• W(2) ∈ R

k×(m+1) is the weight matrix of connections from hidden layer to output layer. Each row in

this matrix corresponds to the weight vector at each output layer unit.

We also further assume that there are n training samples when performing learning task of Neural Network.

In the next section, we will explain how to perform learning in Neural Network.

3.3.2 Feedforward Propagation

In Feedforward Propagation, given parameters of Neural Network and a feature vector x, we want to compute

the probability that this feature vector belongs to a particular digit.

Suppose that we have totally m hidden units. Let aj for 1 ≤ j ≤ m be the linear combination of input

data and let zj be the output from the hidden unit j after applying an activation function (in this exercise,

we use sigmoid as an activation function). For each hidden unit j (j = 1, 2, · · · , m), we can compute its

value as follow:

aj =

X

d+1

i=1

w

(1)

ji xi (1)

3

zj = σ(aj ) = 1

1 + exp(−aj )

(2)

where w

(1)

ji = W(1)[j][i] is the weight of connection from unit i in input layer to unit j in hidden layer. Note

that we do not compute the output for the bias hidden node (m + 1); zm+1 is directly set to 1.

The third layer in neural network is called the output layer where the learned features in hidden units

are linearly combined and a sigmoid function is applied to produce the output. Since in this assignment,

we want to classify a hand-written digit image to its corresponding class, we can use the one-vs-all binary

classification in which each output unit l (l = 1, 2, · · · , k) in neural network represents the probability of an

image belongs to a particular digit. For this reason, the total number of output units is k = 10. Concretely,

for each output unit l (l = 1, 2, · · · , k), we can compute its value as follow:

bl =

mX

+1

j=1

w

(2)

lj zj (3)

ol = σ(bl) = 1

1 + exp(−bl)

(4)

Now we have finished the Feedforward pass.

3.3.3 Error function and Backpropagation

The error function in this case is the squared loss error function. For the p

th input example, the error can

be expressed as:

Jp(W(1), W(2)) = 1

2

X

k

l=1

(ypl − opl)

2

(5)

where ypl indicates the l

th target value in 1-of-K coding scheme of input data p and opl is the output at l

th

output node for the p

th data example (See (4)).

The total error for the entire training data is simply an average over all training examples:

J(W(1), W(2)) = 1

n

Xn

p=1

Jp(W(1), W(2)) = 1

n

Xn

p=1

1

2

X

k

l=1

(ypl − opl)

2

(6)

One way to learn the model parameters in neural networks is to initialize the weights to some random

numbers and compute the output value (feed-forward), then compute the error in prediction, transmit this

error backward and update the weights accordingly (error backpropagation).

The feed-forward step can be computed directly using formula (1), (2), (3) and (4).

On the other hand, the error backpropagation step requires computing the derivative of error function

with respect to the weight.

Consider the derivative of error function with respect to the weight from the hidden unit j to output

unit l where j = 1, 2, · · · , m + 1 and l = 1, · · · , k:

∂Jp

∂w(2)

lj

=

∂Jp

∂ol

∂ol

∂bl

∂bl

∂w(2)

lj

(7)

= −δlzj (8)

where

δl =

∂Jp

∂ol

∂ol

∂bl

= (yl − ol)(1 − ol)ol (9)

Note that we are dropping the subscript p for simplicity.

4

On the other hand, the derivative of error function with respect to the weight from the input unit i to

output unit j where i = 1, 2, · · · , d + 1 and j = 1, · · · , m can be computed as follow:

∂Jp

∂w(1)

ji

=

X

k

l=1

∂Jp

∂ol

∂ol

∂bl

∂bl

∂zj

∂zj

∂aj

∂aj

∂w(1)

ji

(10)

= −

X

k

l=1

δlw

(2)

lj (1 − zj )zjxi (11)

= −(1 − zj )zj (

X

k

l=1

δlw

(2)

lj )xi (12)

where, δl

is calculated using (9). Note that we do not compute the gradient for the weights at the bias

hidden node.

After finishing computing the derivative of error function with respect to weight of each connection in

neural network, we now can write the formula for the gradient of error function:

∇J(W(1), W(2)) = 1

n

Xn

p=1

∇Jp(W(1), W(2)) (13)

We again can use the gradient descent to update each weight (denoted in general as w) with the following

rule:

w

new = w

old − η∇J(w

old) (14)

3.3.4 Regularization in Neural Network

In order to avoid overfitting problem (the learning model is best fit with the training data but give poor

generalization when test with validation data), we can add a regularization term into our error function to

control the magnitude of parameters in Neural Network. Therefore, our objective function can be rewritten

as follow:

Je(W(1), W(2)) = J(W(1), W(2)) + λ

2n

Xm

j=1

X

d+1

i=1

(w

(1)

ji )

2 +

X

k

l=1

mX

+1

j=1

(w

(2)

lj )

2

(15)

where λ is the regularization coefficient.

With this new objective function, the partial derivative of new objective function with respect to weight

from hidden layer to output layer can be calculated as follow:

∂Je

∂w(2)

lj

=

1

n

Xn

p=1

∂Jp

∂w(2)

lj

+ λw(2)

lj !

(16)

Similarly, the partial derivative of new objective function with respect to weight from input layer to hidden

layer can be calculated as follow:

∂Je

∂w(1)

ji

=

1

n

Xn

p=1

∂Jp

∂w(1)

ji

+ λw(1)

ji !

(17)

With this new formulas for computing objective function (15) and its partial derivative with respect to

weights (16) (17) , we can again use gradient descent to find the minimum of objective function.

3.3.5 Python implementation of Neural Network

In the supporting files, we have provided the base code for you to complete. In particular, you have to

complete the following function in Python:

• sigmoid: compute sigmoid function. The input can be a scalar value, a vector or a matrix.

5

• nnObjFunction: compute the objective function of Neural Network with regularization and the gradient

of objective function.

• nnPredict: predicts the label of data given the parameters of Neural Network.

Details of how to implement the required functions is explained in Python code.

Optimization: In general, the learning phase of Neural Network consists of 2 tasks. First task is

to compute the value and gradient of error function given Neural Network parameters. Second task

is to optimize the error function given the value and gradient of that error function. As explained

earlier, we can use gradient descent to perform the optimization problem. In this assignment, you

have to use the Python scipy function: scipy.optimize.minimize (using the option method=’CG’

for conjugate gradient descent), which performs the conjugate gradient descent algorithm to perform

optimization task. In principle, conjugate gradient descent is similar to gradient descent but it chooses

a more sophisticated learning rate η in each iteration so that it will converge faster than gradient

descent. Details of how to use minimize are provided here: http://docs.scipy.org/doc/scipy-0.

14.0/reference/generated/scipy.optimize.minimize.html.

We use regularization in Neural Network to avoid overfitting problem (more about this will be discussed

in class). You are encouraged to change different value of λ to see its effect in prediction accuracy in

validation set. Your report should include diagrams to explain the relation between λ and performance of

Neural Network. Moreover, by plotting the value of λ with respect to the accuracy of Neural Network, you

should explain in your report how to choose an appropriate hyper-parameter λ to avoid both underfitting

and overfitting problem. You can vary λ from 0 (no regularization) to 1 in increments of 0.1 or 0.2.

You are also encouraged to try different number hidden units to see its effect to the performance of Neural

Network. Since training Neural Network is very slow, especially when the number hidden units in Neural

Network is large. You should try with small hidden units and gradually increase the size and see how it

effects the training time. Your report should include some diagrams to explain relation between number of

hidden units and training time. Recommended values: 4, 8, 12, 16, 20.

4 Submission

You are required to submit a single file called proj1.zip using UBLearns.

File proj1.zip must contain 2 folders: report and code.

• Folder report contains your report file (in pdf format). Please indicate your group number, the

team members and your course number on the top of the report.

• Folder code must contains the following updated files: nnScript.py and params.pickle1

. File params.pickle

contains the learned parameters of Neural Network. Concretely, file params.pickle must contain the

following variables: optimal n hidden (number of units in hidden layer), w1 (matrix of weight W(1)

as mentioned in section 3.2.1), w2 (matrix of weight W(2) as mentioned in section 3.2.1), optimal λ

(regularization coeffient λ as mentioned in section 3.2.4).2

Using UBLearns Submission: In the groups page of the UBLearns website you will see groups

called “Programming Assignment 1 Group x”. Please choose any available group number for your group

and join the group. All project group members must join the same group. Please do not join any other

group on UBLearns that you are not part of. You should submit one solution per group through the

groups page.

1Check this to learn how to pickle objects in Python: https://wiki.python.org/moin/UsingPickle

2

If you want to write more supporting functions to complete the required functions, you should include these supporting

functions and a README file which explains your supporting functions.

6

Project report: The hard-copy of report will be collected in class at due date. Your report should include

the following:

• Explanation of how to choose the hyper-parameters for Neural Network (number of hidden units,

regularization term λ).

5 Grading scheme

• Succcessfully implement Neural Network: 60 points (preprocess() [10 points], sigmoid() [10 points],

nnObjFunction() [30 points], nnPredict() [10 points]).

• Project report: 30 points

– Explanation with supporting figures of how to choose the hyper-parameter for Neural Network:

30 points

• Accuracy of classification method on the test data: 10 points

References

[1] LeCun, Yann; Corinna Cortes, Christopher J.C. Burges. “MNIST handwritten digit database”.

[2] Bishop, Christopher M. “Pattern recognition and machine learning (information science and statistics).”

(2007).

7