Description
1 Overview
Deep learning has quickly become one of the most applied machine learning techniques
in computer vision. Convolutional neural networks have been applied to many different
computer vision problems such as image classification, recognition, and segmentation with
great success. In this assignment, you will first implement a fully connected feed forward
neural network for hand written character classification. Then in the second part, you will
implement a system to locate characters in an image, which you can then classify with your
deep network. The end result will be a system that, given an image of hand written text,
will output the text contained in the image.
1.1 Basic Use
Here we will give a brief overview of the math for a single hidden layer feed forward network.
For a more detailed look at the math and derivation, please see the class slides.
Figure 2: Example of a single hidden layer network
A fully-connected network, f, for classification, applies a series of linear and non-linear
functions to an input data vector x of size N × 1 to produce an output vector f(x) of size
C × 1, where each element i of the output vector represents the probability of x belonging
to the class i. Since the data samples are of dimensionality N, this means the input layer
2
has N input units. To compute the value of the output units, we must first compute the
values of all the hidden layers. The first hidden layer pre-activation a
(1)(x) is given by
a
(1)(x) = W(1)x + b
(1)
Then the post-activation values of the first hidden layer h
(1)(x) are computed by applying
a non-linear activation function g to the pre-activation values
h
(1)(x) = g(a
(1)(x)) = g(W(1)x + b
(1))
The pre- and post- activations of Subsequent hidden layers (1 < t ≤ T) are given by:
a
(t)
(x) = W(t)h
(t−1) + b
(t)
h
(t)
(x) = g(a
(t)
(x))
The output layer pre-activations a
(T)
(x) are computed in a similar way
a
(T)
(x) = W(T)h
(T −1)(x) + b
(T)
and finally the post-activation values of the output layer are computed with
f(x) = o(a
(T)
(x)) = o(W(T)h
(T −1)(x) + b
(T)
)
where o is the output activation function.
For this assignment, we will be using the sigmoid activation function for the hidden layer,
so:
g(y) = 1
1 + exp(−y)
where when g is applied to a vector, it is applied element wise across the vector.
Since we are using this network for classification, a common output activation function
to use is the softmax function. This will allow us to map the real valued activations a
(T)
(x)
into a probability distribution (vector of positive numbers that sum to 1). Letting xi denote
the i
th element of the vector x, the softmax function is defined as:
oi(y) = exp(yi)
P
j
exp(yj )
Q1.1.1 “Theory” [5 points] In training deep networks, the ReLU activation function is
generally preferred to the sigmoid activation function. Why might this be the case?
Q1.1.2 Theory [5 points] All types of deep networks use non-linear activation functions
for their hidden layers. Suppose we have a neural network with input dimension N and
output dimension C and T hidden layers. Prove that if we have a linear activation function
g, then the number of hidden layers has no effect on the representation capability of the
network (i.e., that the set of functions that can be represented by a T layer network is
exactly the same as the set that can be represented by a T
0 6= T layer network).
3
2 Implement a Fully Connected Network
In this section, you will implement all of the functions needed to initialize, train, evaluate,
and use the network. Throughout this assignment, the network will be represented by its
parameters W and b, the weights and biases of the network.
2.1 Network Initialization
In order to use a deep network, we must first initialize the weights and biases in the network.
This is typically done with a random initialization, or initializing the weights from some
other training procedure (which we will see in Section 3). If you are interested in issues
relating to initialization, [2] is a good paper to look at.
Q2.1.1 Theory [5 points] Why is it not a good idea to initialize a network with all zeros?
How about all ones, or some other constant value? (Hint: Consider what the gradients from
backpropagation will look like.)
Q2.1.2 Code [5 points] Implement [W, b] = InitializeNetwork(layers). This function should take as input the sizes of the layers for your neural network. The layers parameters will be a vector of at least 3 integers. The first element denotes the size of the data
layer N, the last element denotes the size of the output layer/number of classes C, and the
intermediate elements denote the size of the hidden layers. You should return W and b which
are both cell arrays containing length(layers)-1 elements. W should contain at least two
weight matrices of appropriate size, and b should contain at least two bias vectors.
Q2.1.3 Writeup [5 points] Describe the initialization you implemented in Q2.1.2 and
any reasoning behind why you chose that strategy.
2.2 Forward Propagation
Section 1 has the math for forward propagation. The loss function generally used for classification is the cross-entropy loss.
Lf(D) = −
X
(x,y)∈D
log(y · f(x))
Here D is the full training dataset of data samples x (N × 1 vectors) and labels y (C × 1
one-hot vectors).
Q2.2.1 Code [10 points] Implement [out, act a, act h] = Forward(W, b, X) runs
forward propagation on an input data sample X, using the network defined by the parameters
W and b. Both W and b are cell arrays containing the network parameters as initialized by
InitializeNetwork(..). X is a vector of dimensions N × 1. This function should return
the final softmax output of the network in the variable out, which is of size C × 1. The
function must also recturn the cell arrays act a and act h, which contain the pre- and
post-activations of the network on this input sample.
4
Q2.2.2 Code [5 points] Implement [outputs] = Classify(W, b, data). This function should accept the network parameters W and b, as well as a D × N matrix of data
samples. outputs should contain the softmax output of the network for each of the data
samples.
Q2.2.3 Code [10 points] Implement [accuracy, loss] = ComputeAccuracyAndLoss(W,
b, data, labels). This function should compute the accuracy on the network with respect to the provided data matrix of size D × N and labels matrix of size D × C. You
should also compute and return the average cross-entropy loss on the dataset.
2.3 Backwards Propagation
Gradient descent is an iterative optimisation algorithm, used to find the local optima. To
find the local minima, we start at a point on the function and move in the direction of
negative gradient (steepest descent) till some stopping criteria is met.
The update equation for a general weight W
(t)
ij and bias b
(t)
i
is
W
(t)
ij = W
(t)
ij − α ∗
∂Lf
∂W(t)
ij
(x) b
(t)
i = b
(t)
i − α ∗
∂Lf
∂b(t)
i
(x)
α is the learning rate. Please refer to the class slides for more details on how to derive the
gradients. Note that here we are using softmax loss (which is different from the derivations
we saw in class.
Q2.3.1 Code [30 points] Implement [grad W, grad b] = Backward(W, b, X, Y, act h,
act a) that runs the back propogation algorithm on a single input example X and the ground
truth vector Y. The function takes as input the network parameters W and b and the network
pre- and post-activations act a and act h from calling Forward on X. The function should
return the computed gradient updates for the network parameters. grad W and grad b are
both cell arrays identical in shape to the kinds produced by InitializeNetwork(..), but
containing the gradient updates to make to each of the network parameters .
Q2.3.2 Code [5 points] Implement
[W, b] = UpdateParameters(W, b, grad W, grad b, learning rate) that computes and
returns the updated network parameters W and b. The function is given the old parameters,
the parameters gradients grad W and grad b returned by Backward(..), and the supplied
learning rate.
2.4 Training Loop
When training a neural network using Gradient Descent, there are two options one can
take for the updates, batch or stochastic updates. In batch gradient descent, we compute
the gradient for all the examples in the training set and then average the gradients before
updating the weights. In stochastic, we compute the gradient with one sample, update
the weights with that gradient and carry on with the next sample. Each update is called
a iteration, one complete pass through the entire dataset is called an epoch. In case of
stochastic gradient descent, number of iterations in an epoch equals the number of training
samples.
5
Q2.4.1 Theory [5 points] Give pros and cons for both stochastic and batch gradient
descent. In general, which one is faster to train in terms of number of epochs? Which one
is faster in terms of number of iterations?
Q2.4.2 Code [10 points] Implement
[W, b] = Train(W, b, train data, train label, learning rate) which trains the network for one epoch on train data and ground-truth train label. The function should
return the updated network parameters.
2.5 Gradient Checker [25 points]
Often, when implementing new layers in popular packages like Caffe or Torch you will
have to implement the forward and backward pass for that layer. While the forward is
fairly easy, a lot of things can go wrong in the backward pass. One common way to check
if the backward pass is correct is by writing a gradient checker. A good explanation of
a gradient checker is in the link: http://ufldl.stanford.edu/tutorial/supervised/
DebuggingGradientChecking/
Q2.5.1 Code [25 points] Implement a script checkGradient.m that checks gradients of
the loss with respect to a few random weights in each layer. Submit the code.
3 Training Models
We now have all the code required to train and use a deep network. In this section, you will
train and evaluate different models which will then be used in Section 4 for parsing text in
images.
3.1 From Scratch
Often times when you encounter new problems in deep learning, you will need to train a new
network from scratch. This means that you will randomly initialize the weights in some way,
and run multiple iterations of back propagation. The Train.m function you implemented in
runs one epoch of training, passing over all data samples once. To fully train a deep network
you typically must do many epochs.
Once a deep network is trained, one way to gain some insight into what it is doing is to
visualize the learned weights. With convolutional neural networks as seen in class, each of
the learned filters can be visualized as an image. This can also be done with a fully connected network, as you have implemented here. Since our input images are 32 × 32 images,
unrolled into one 1024 dimensional vector that gets multiplied by W(1), each row of W(1)
can be seen as a weight image. Reshaping each row into a 32 ×32 image can give us an idea
of what types of images each unit in the hidden layer has a high response to.
We have provided you three data .mat files to use for this section. The training data in
nist26 train.mat contains 299 samples for each of the 26 upper-case letters of the alphabet. This is the set you should use for training your network. The cross-validation set in
nist26 valid.mat contains 100 samples from each class, and should be used in the training
6
Figure 3: Samples weights from a network with 50 hidden units trained for 30 epochs. Your
weights will likely look very different, as you are training with more hidden units.
loop to see how the network is performing on data that it is not training on. This will help
to spot over fitting. Finally, the test data in nist26 test.mat contains another 100 samples
per class, and should be used for the final evaluation on your best model to see how well it
will generalize to new unseen data.
Q3.1.1 Code [5 points] Use the provided train26.m to train a network from scratch.
Use a single hidden layer with 400 hidden units, and train for at least 30 epochs. Modify
the script to plot generate two plots: one showing the accuracy on both the training and
validation set over the epochs, and the other showing the cross-entropy loss. The x-axis
should represent the epoch number, while the y-axis represents the accuracy or loss. With
these settings, you should see an accuracy on the validation set of at least 75%.
Q3.1.2 Writeup [5 points] Use your modified training script to train two networks,
one with learning rate 0.01, and another with learning rate 0.001. Include all 4 plots in
your writeup. Comment on how the learning rates affect the training, and report the final
accuracy of the best network on the test set.
Q3.1.3 Writeup [5 points] Using the best network from the previous question, report
the accuracy and cross-entropy loss on the test set, and visualize the first layer weights that
your network learned (using reshape and montage). Compare these to the network weights
immediately after initialization. Include both visualizations in your writeup. Comment on
the learned weights. Do you notice any patterns?
Q3.1.4 Writeup [5 points] Visualize the confusion matrix for your best model as a
26 × 26 image (upscale the image so we can actually see it). Comment on the top two pairs
of classes that are most commonly confused.
7
Figure 4: Learned filters from AlexNet trained on the ImageNet dataset. AlexNet is a
convolutional neural network who’s architecture and weights are often used as the basis for
new networks.
3.2 Fine Tuning
When training from scratch, a lot of epochs are often needed to learn anything meaningful.
One way to avoid this is to instead initialize the weights more intelligently. Various strategies have been used for initializing neural networks, such as unsupervised pretraining with
Auto-Encoders or Restricted Boltzmann Machines.
However thanks to the explosion in popularity of deep learning for computer vision, it is
often possible to also initialize a network with weights from another deep network that was
trained for a different purpose. This is because, whether we are doing image classification,
segmentation, recognition etc..., most real images share common properties. Simply copying the weights from the other network to yours gives your network a head start, so your
network does not need to learn these common weights from scratch all over again. This is
also referred to as fine tuning.
We have trained a network for recognizing capital letters using 800 hidden units, and
trained it for 60 epochs. The network parameters are stored in nist26 model 60iters.mat.
Using these weights, you will initialize and fine-tune a network for recognizing both capital
letters as well as numeric digits.
Q3.2.1 Code/Writeup [10 points] Make a copy of train26.m and name it finetune36.m.
Modify this script to load the data from nist36 *.mat, and train a network to classify both
written letters and numbers. Finetune (train) this network for 5 epochs with learning rate
8
0.01, and include plots of the accuracy and cross-entropy loss in your writeup.
Q3.2.2 Writeup [5 points] Once again, visualize the network’s first layer weights before
and after training. Comment on the differences you see. Also report the network’s accuracy
and loss on the test set.
Q3.2.3 Writeup [5 points] Visualize the confusion matrix for your best model as a
36 × 36 image (upscale the image so we can actually see it). Comment on the top two pairs
of classes that are most commonly confused. How has introducing more classes affected the
network?
4 Extract Text from Images
Figure 5: Sample image with handwritten characters annotated with boxes around each
character.
Now that you have a network that can recognize handwritten letters with reasonable
accuracy, you can now use it to parse text in an image. Given an image with some text
on it, our goal is to have a function that returns the actual text in the image. However
since your neural network expects a a binary image with a single character, you will need to
process the input image to extract each character. There are various approaches that can
be done so feel free to use any strategy you like.
Here we outline one possible method:
1. Process the image (blur, threshold, etc...) to classify all pixels as being part of a
character or background.
2. Find connected groups of character pixels (see bwconncomp). Place a bounding box
around each connected component.
9
3. Group the letters based on which line of the text they are a part of, and sort each
group so that the letters are in the order they appear on the page.
4. Take each bounding box one at a time and resize it to 32 × 32, classify it with your
network, and report the characters in order (inserting spaces when it makes sense).
Since the network you trained likely does not have perfect accuracy, you can expect there
to be some errors in your final text parsing. Whichever method you choose to implement
for the character detection, you should be able to place a box on most of there characters in
the image. We have provided you with 01 list.jpg, 02 letters.jpg, 03 haiku.jpg and
04 deep.jpg to test your implementation on.
Q4.1 Theory [5 points] The method outlined above is pretty simplistic, and makes
several assumptions. What are two big assumptions that the sample method makes. In
your writeup, include two example images where you expect the character detection to fail
(either miss valid letters, or respond to non-letters).
Q4.2 Code [25 points] Implement [lines, bw] = findLetters(im). Given an RGB
image, this function should return a cell array lines containing all of the located handwritten characters in the image, as well as a binary black-and-white version of the image im.
The letters cell array should contain one entry for each line of text that appears in the
image. Each entry should be a matrix of size Li × 4, where Li represents the number of
characters in the i
th line of text. Each row of the matrix should contain [x1, y1, x2, y2]
the positions of the top-left and bottom-right corners of the box. The processed image bw
should be only black and white (containing pixel values 0.0 and 1.0), with the characters in
black and the rest of the image in white. Also include a test script testFindLetters.m that
runs your function on all of the provided images (and any others you decided to include),
and displays a figure showing the bounding boxes around each letter.
Q4.3 Writeup [5 points] Run findLetters(..) on all of the provided sample images
in images/. Plot all of the located boxes on top of the image to show the accuracy of your
findLetters(..) function. Include all the result images in your writeup.
Q4.4 Code [15 points] Implement [text] = extractImageText(fname). This function
should accept an image path fname, and return the text contained in the image. This
function will load the image, find the character locations, classify each one with the network
you trained in Q3.2.1, and return the text contained in the image. Also include a test script
testExtractImageText.m that runs your function on all of the provided images (and any
others you decided to include), and outputs the retrieved text for each image.
Q4.5 Writeup [5 points] Run your extractImageText(..) on all of the provided
sample images in images/. Include the extracted text in your writeup.
10
5 Submission
5.1 Handin Zip Structure
-
–
– code/
– Backwards.m (30 points)
– checkGradient.m (25 points)
– Classify.m (5 points)
– ComputeAccuracyAndLoss.m (10 points)
– extractImageText.m (10 points)
– findLetters.m (20 points)
– finetune36.m (5 points)
– Forward.m (10 points)
– InitializeNetwork.m (5 points)
– Train.m (10 points)
– testFindLetters.m (5 points)
– testExtractImageText.m (5 points)
– train26.m (5 points)
– UpdateParameters.m (5 points)
– Any other code or helper functions you used
5.2 Writeup Summary
Q1.1.1 Theory [5 points ] ReLU?
Q1.1.2 Theory [5 points ] Linear activation function.
Q2.1.1 Theory [5 points ] Zero or one initialization.
Q2.1.3 Writeup [5 points ] Describe initialization strategy.
Q2.4.1 Theory [5 points ] Batch vs. stochastic gradient descent.
Q3.1.2 Writeup [5 points ] Training plots and comments on nist26 data
Q3.1.3 Writeup [5 points ] Network test set performance and weights visualization
Q3.1.4 Writeup [5 points ] Confusion matrix image visualization and comments
Q3.2.1 Writeup [5 points ] Training plot for fine tuning
Q3.2.2 Writeup [5 points ] Fine tuned weights visualization
Q3.2.3 Writeup [5 points ] Confusion matrix image for fine tuned network
Q4.1 Theory [5 points ] Failure cases for sample character detection algorithm
Q4.3 Writeup [5 points ] Character detection results on each of the images in images/
Q4.5 Writeup [5 points ] Find letters from each of the images in images/
11
References
[1] P. J. Grother. Nist special database 19 handprinted forms and characters database.
https://www.nist.gov/srd/nist-special-database-19, 1995.
[2] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks.
12