## Description

1 Overview

1.1 Introduction

In the last assignment, you used Bayesian networks to model real-world genetic inheritance

networks. Your rival claims that this application to genetic inheritance underscores the limited applicability of graphical models, because one doesn’t often nd problems with network

structures that clear.

To prove him wrong, you decide to apply the graphical model framework to the task of

optical character recognition (OCR), a problem that is considerably messier than that of genetic

inheritance. Your goal is to accept as input an image of text and output the text content itself.

The real-world applications of OCR are endless. Some examples include:

1. The Google Books project has scanned millions of printed books from libraries around

the country. Using OCR, these book scans can be converted to text les, making them

available for searching and downloading as eBooks.

2. There has long been research on OCR applied to handwritten documents. This research

has been so successful that the US Postal Service can use OCR to pre-sort mail (based

on the handwritten address on each envelope), and many ATMs now automatically read

the checks you deposit so the funds are available sooner without the need for human

intervention.

3. Research on OCR for real-world photos has made it possible for visually impaired individuals to read the text of street and building signs with the assistance of only a small

camera. The camera captures an image of the sign, runs OCR on that image, and then

uses text-to-speech synthesis to read the contents of the sign.

In this assignment, we will give you sets of images corresponding to handwritten characters

in a word. Your task is to build a graphical model to recognize the character in each image

as accurately as possible. This assignment is based on an assignment developed by Professor

Andrew McCallum, Sameer Singh, and Michael Wick, from the University of Massachussetts,

Amherst.

The full OCR system will be split across two assignments. In this one, you will construct a

Markov network with a variety of dierent factors to gain familiarity with undirected graphical

models and conditional random elds (CRFs) in particular. We provide code that will perform

inference in the network you construct, so you can nd the best character assignment for every

image and evaluate the performance of the network. In Programming Assignment 7, you will

extend the network to incorporate more features, use an inference engine that you built yourself,

and learn the optimal factor weights from data.

CS228 Programming Assignment #3 2

1.2 Markov Networks for OCR

Suppose you are given n total character images (corresponding to a single word of length n).

There are two sets of variables in the model we will build: Ii and Ci for i = 1, . . . , n. Ii

is a

vector-valued variable that corresponds to the actual pixel values of the ith image, and Ci

is the

character assignment to the ith image. Let K be the total number of possible character values.

For this assignment, we only use lower case English letters, so K = 26. For example, C3 = 5

means that the third character image contains the character `e’.

It is important to note that:

• The Ii variables are always observed, while the Ci variables are never observed.

• We want to nd the assignment to the Ci variables that correctly describes the images in

the Ii variables.

In this assignment, we will use a Markov network to model the distribution over the C

variables, given the images I. This is one example of a possible Markov network over an observed

word of length 5:

In this example, we have factors φ

C

i

(Ci

, Ii) that represent how likely a given image is a

particular character, and factors φ

P

i

(Ci

, Ci+1) that represent the interactions between adjacent

pairs of characters. Given the images I, we can come up with a prediction C

∗ by running MAP

inference on the network to nd the assignment C

∗ = argmaxC P˜(C, I). For this particular

example, we would hope that the factors we have chosen allow us to recover the word queue

as the MAP assignment C

∗

.

Even with just these two types of factors, we can see that Markov networks allow us to

use powerful models that not only take into account which character each image resembles, but

also higher-order dependencies between characters (e.g., a `q’ is likely to be followed by a `u’.)

We will elaborate more on these factors and explore alternative Markov networks later in this

assignment.

Since I is always observed, Markov networks like these can be seen as modeling the conditional

distribution P(C|I) in lieu of the joint distribution P(C, I). They are therefore also known as

conditional random elds (CRFs).

CS228 Programming Assignment #3 3

2 The Basic OCR Network

2.1 Provided Data

We provide you with all the data you need in the three les PA3Data.mat, PA3Models.mat, and

PA3SampleCases.mat. PA3Data.mat contains the following data structure:

• allWords: This is a cell array of every word from the dataset. In particular, allwords{i}

returns a struct array containing the data for the ith word in the dataset. That struct

array will have length(word i) entries, each with the following values:

groundTruth: A value 1 through 26 that provides the true value for that character.

img: The image pixel values for that character. It is a 16×8 matrix, where each entry

is a pixel: 1 for black, 0 for white.

PA3Models.mat contains:

• imageModel: This is a struct containing the following values:

K: The size of the alphabet, i.e., the number of possible values for each Ci

. Set to 26.

params: The parameters used to calculate singleton factors (see next section).

• pairwiseModel: This is a 26 × 26 matrix that will be used to construct pairwise factors

later on in the assignment.

• tripletList: This is a struct array of 2000 triplets, used to construct triplet factors later

on.

Lastly, PA3SampleCases.mat contains the input and correct output for each of the 7 sample

cases that we test for in the submit script.

2.2 Singleton OCR Features

Before you begin to add more complex features to the model, it is important to establish a

working baseline that is as simple as possible. That way, you can meaningfully evaluate the

eectiveness of every change and improvement you make.

Your baseline will be a model that contains only the factors for singleton OCR features,

φ

C . There are n such factors, one for every character image. The scope of φ

C

i

, the factor for

the i

th image, is {Ci

, Ii}; we call them singleton factors because Ii

is always observed, so these

factors essentially operate on single characters. This model looks like:

Using logistic regression on a separate dataset of OCR images, we have pretrained a model for

you to compute a score for each image/character pair. (Don’t worry if you haven’t heard about

CS228 Programming Assignment #3 4

logistic regression: the important part is that we have a function that scores the match between

each individual image/character pair.) This model is implemented by ComputeImageFactor(image,

imageModel), which returns a vector of K values. The i

th entry in that vector is the score

for character i on the provided image. You should write code to use this function to create the

OCR factors φ

C . In particular, you should implement:

• ComputeSingletonFactors.m: This function takes a list of images and the provided

image model. You should ll out the code to generate one factor for every image provided.

The factor values should be given by the vector returned from ComputeImageFactor on

each image.

2.3 Performing Inference in the OCR Network

The Markov network which using only the singleton factors from the previous section is an

entirely valid, though simplistic, Markov network. To establish our baseline performance metrics,

we need to perform inference in the network. In particular, we need to compute values for

C1, . . . , Cn that maximize their joint probability. Since we only have singleton factors right

now, this is straightforward – in fact, adjacent characters have no eect on each other, so it

suces to nd the MAP assignment to each of C1, . . . , Cn independently. But to prepare for the

introduction of more complex factors in the latter part of this assignment, we will use a more

general mechanism.

In the next programming assignment, it will be up to you to implement ecient inference in a Markov network. For this assignment, however, we have provided you the function

RunInference.m which calls a binary (C++) implementation of an inference engine to nd the

MAP assignment given a set of factors. We have also provided the BuildOCRNetwork.m function,

which strings together all of the factors for a single word. Since you only have singleton factors

right now, you can ignore the pairwiseModel and tripletList parameters for now.

Fill in the following function:

• ComputeWordPredictions.m: This function accepts the cell array of word images (i.e.,

allWords) and returns a cell array such that wordPredictions{i} is an array of the

predicted (MAP) assignments to the i

th word’s characters. For example, if I believe the

rst word is ‘cat’, then I have wordPredictions{1} = [3 1 20]. The RunInference.m

function will be very useful here.

ComputeWordPredictions is used in conjunction with the ScorePredictions function, which

we have provided. Given the cell array allWords and the predictions computed above, this

function returns the character and word level accuracies. The character level accuracy is simply

the number of correctly identied characters divided by the total number of characters. The

word accuracy is the number of words in which every character is correctly identied divided by

the total number of words.

You should now be able to run:

> singletonPredictions = ComputeWordPredictions(allWords, imageModel, pairwiseModel,

tripletList);

> [charAcc, wordAcc] = ScorePredictions(allWords, singletonPredictions);

to compute the character and word accuracies of the singleton model. You should see values

of 76.7% and 22.0%, respectively.

CS228 Programming Assignment #3 5

3 Extending the Network

You now have a model that performs reasonably well on the OCR task, but there is a good deal

of room for improvement. One issue with the singleton OCR factors is that they do not consider

any interactions between multiple character assignments. This is, after all, the point of the

Markov network: it gives us a structured way to represent these interactions. We will consider

two sorts of interactions. The rst are linguistic interactions that we will call the pairwise and

triplet factors, φ

P and φ

T

. The second are image interactions that we will call similarity factors,

φ

S.

3.1 Pairwise Factors

We begin with the pairwise factors. The intuition behind these factors is as follows. Suppose I

am a sloppy writer, and my u’s are so round that they look the same as a’s. In isolation, then,

it can be dicult for the OCR factors to distinguish between these two characters. But suppose

that the OCR factor for the previous character assigns a very high score to the letter `q’; i.e.,

we are fairly certain that the previous letter is a `q’. In that case, the `u’ vs. `a’ ambiguity

disappears. The odds of seeing `qu’ in English text are so much higher than seeing `qa’ that I

can be nearly certain that the letter is a u even if the OCR factor assigns them nearly equal

scores.

More formally, suppose we have a word with n characters, C1, . . . , Cn. We will introduce n−1

factors, φ

P

i

(Ci

, Ci+1) for i = 1, . . . , n − 1, giving us the rst model we saw in the assignment:

What values should we use for these factors? We consider two possibilities. First, implement:

• ComputeEqualPairwiseFactors.m: This function (and the following one) accepts an

array of the images in one word and the value K, the size of the alphabet (for our purposes,

it is 26). You should compute the n−1 pairwise factors described above. For this function,

you should assign a value of 1 to every single possible factor value.

Extend BuildOCRNetwork.m to include these factors as well as the singleton OCR factors from

before. Use ComputeWordPredictions and ScorePredictions to obtain the MAP assignment

and compute character and word accuracies as you did before. What values do you get? Why

is this?

That didn’t help very much. Instead of having uniform factors, we want to assign higher

probabilities to common pairs of letters, and lower probabilities to rare ones. In PA3Data.mat,

we have provided the pairwiseModel matrix, where pairwiseModel(i,j) is proportional to

the probability that the j

th letter appears after seeing the i

th letter (these probabilities were

extracted from a separate corpus of English text). With this, implement:

CS228 Programming Assignment #3 6

• ComputePairwiseFactors.m: This function is just like ComputeEqualPairwiseFactors,

but it also uses the pairwiseModel we provided. For the factor values, an assignment of

character i to the rst character and character j to the second should have a score of

pairwiseModel(i,j). Hint: Every pairwise factor in the graph will have the same values,

so you should consider building the val eld once at the beginning and then assigning it

to each pair.

Modify BuildOCRNetwork.m to use the original singleton factors and these new, improved

pairwise factors. Compute the character and word accuracies as before. You should get values

of 79.2% and 26.0%.

3.2 Triplet Factors

Triplet factors serve a similar purpose as pairwise factors: we want to encourage common triplets,

such as ing, while discouraging uncommon ones. Adding in triplet factors to our model results

in a network like this:

Formally speaking, we will introduce an additional set of n−2 factors where the ith factor is

φ

T

i

(Ci

, Ci+1, Ci+2) for i = 1, . . . , n − 2. Note that edges in a Markov network do not correspond

directly to factors, in the sense that there are no hyperedges that connect triplets of nodes at

once. Instead, our set of factors needs to factorize over the Markov network, which means that

the scope of each factor needs to be a clique in the Markov network. As such, we have added

extra edges between every Ci and Ci+2 in the above network.

There are 263 = 17, 576 possible states in a factor over triplets, making this factor large and

unwieldy. Instead of assigning a dierent value to each of them, we will initialize all values to 1,

and only change the values of the top 2000 triplets. We will set φ

T

i

(Ci = a, Ci+1 = b, Ci+2 = c) to

be proportional to the probability that we see the letter c after seeing a and then b in succession,

and normalize the factor such that the value of the 2000th most common triplet is 1. These

normalized frequencies can be found in tripletList.mat (in PA3Data.mat).

There are two reasons for only selecting the top 2000. Firstly, having sparse factors like

this makes inference faster (if the inference engine exploits it); secondly, rare triplets (e.g., zxg)

are unlikely to be accurately represented in the corpus of English words that we extracted

the triplet frequencies from, so we would prefer not to rely on their measured frequencies (in

particular, rare triplets are likely to be several orders of magnitude less frequent than their

common counterparts, which can severely throw o the graphical model in some scenarios). In

Programming Assignment 7, we will also investigate how this kind of sparsity can help us gain

better performance by reducing the extent of overtting when learning parameters from data.

Fill in the following function:

CS228 Programming Assignment #3 7

• ComputeTripletFactors.m: This function is just like ComputePairwiseFactors, except

that it uses tripletList.mat to compute factors over triplets.

and modify BuildOCRNetwork appropriately. Compute the character and word accuracies as

before. You should get values of 80.0% and 34.0%. You will also notice an appreciable slow-down

in the time it takes to calculate the factors and perform inference.

3.3 Image Similarity Factors

In the previous section, we exploited our knowledge about English text and relationships between

adjacent characters. In this section, we will look at longer range factors based on relationships

between character images. In particular, we would like to exploit the following insight: two

images that look similar should be likely to be assigned the same character value, even if we

can’t say what that character value should be.

Formally, suppose we choose two characters i and j (i 6= j) from a word with images Ii and

Ij . We will dene a similarity factor φ

S

ij (Ci

, Cj ) over their character assignments. We want to

select factor values such that the following is true: If Ii and Ij are similar (in a way we must

specify), then φ

S

ij (Ci

, Cj ) will take on a large value for Ci == Cj and a small value for Ci 6= Cj .

If Ii and Ij are not similar, then the reverse is true: φ

S

ij (Ci

, Cj ) should take on a larger value

when Ci 6= Cj .

We provide a function ImageSimilarity.m that computes the “similarity score” between two

images. In our basic implementation, ImageSimilarity(Ii

, Ij) returns 5 if the cosine distance

between the two images Ii and Ij is more than a preset threshold, and 1 if it is not. We set

φ

S

ij (Ci = Cj ) = ImageSimilarity(Ii

, Ij), and φ

S

ij (Ci 6= Cj ) = 1. Feel free to experiment with

more advanced image similarity features to maximize performance!

Given this, you should implement the following functions:

• ComputeSimilarityFactor.m: This function accepts a list of all the images in a word

and two indices, i and j. It should return one factor: the similarity factor between the i

th

and j

th images.

• ComputeAllSimilarityFactors.m: This function should compute a list of every similarity factor for the images in a given word. That is, you should use ComputeSimilarityFactor

for every i, j pair (i 6= j) in the word and return the resulting list of factors.

At your own risk, use BuildOCRNetwork to build a network using all the similarity factors.

We recommend that you save your les rst, and only experiment on words that are no more than

5 characters in length. (The rst word in the dataset has 9 characters, for a full joint distribution

of more than 5 trillion numbers. If you are feeling brave and adventurous, try running this code

on the dataset, and look at inf.log to see how much memory the inference engine estimates it

will need!)

Clearly, adding every similarity factor is a bad idea from a computational point of view:

inference takes forever, and we’re only running inference on single words. Why is this the case?

Think about what the network would have to look like (and which edges we would need to draw)

to support a similarity factor between every pair of variables.

This does not mean, however, that the similarity factors are a bad idea in general; we simply

need to be more selective in adding them to the model. To do so, you should implement the

nal function:

CS228 Programming Assignment #3 8

• ChooseTopSimilarityFactors.m: This function should take in an array of all the similarity factors and a parameter F, and return the top F factors based on their similarity

score. Ties may be broken arbitrarily.

Modify BuildOCRNetwork appropriately, choosing only the top 2 similarity factors, and compute the character and word accuracies as before. You should get values of 81.9% and 37.0%

– a signicant improvement over our starting word accuracy of 22.0%! Play around with the

similarity factors and triplet factors – can you further improve on this level of accuracy?

4 Summary

Hooray! In this assignment, you have successfully used Markov networks (and more specically,

conditional random elds) to do optical character recognition. We saw how Markov networks

allow us to model interactions between characters, leveraging linguistic models (e.g., frequencies

of pairs and triplets of characters) as well as long-range image similarities for better performance.

These are examples of how probabilistic graphical models can capture complex interactions in a

natural way. We also explored the trade-os between complexity, accuracy, and computational

eciency, themes that we will return to repeatedly over the rest of the course.