## Description

1 Overview

In Programming Assignment 3, you learned how to construct a Markov network to perform the

task of optical character recognition (OCR). The values in each factor, however, were provided

for you. This assignment returns to the same OCR task, but this time you will write the code

to learn the factor values automatically from data.

This assignment is divided into two parts. In the rst part, we use a small (but real) parameter

learning task to introduce a handful of key concepts applicable to nearly any machine learning

problem:

• Stochastic gradient descent for parameter learning

• Training, validation, and test data

• Regularization

In the second part of the assignment, you will apply these techniques to the OCR task. In

particular, you will construct a Markov network similar to the one from PA3 and use stochastic

gradient descent to learn the parameters of the network from a set of training data.

Be sure to get a head start on this assignment! The rst part introduces a set of new concepts,

and the second is rather intricate. For the adventurous, we have included an extra credit section,

where you will be graded based on how well your OCR system performs. Good luck!

2 Machine Learning Primer

2.1 Stochastic Gradient Descent For Logistic Regression1

Consider the following task: given a dataset of images of either the letter p or q and the corresponding labels for the images, we wish to build a classier that will allow us to predict the

labels of new unseen images as accurately as possible. More concretely, we have a dataset in

which each data sample consists of 129 features (16 × 8 pixels + 1 bias term), X =X1, . . . , X129,

derived from the images, and the corresponding labels Y for the image (0 if it was an image of

a q, 1 if it was an image of a p).

We will learn a model of the conditional probability P(Y |X) using maximum likelihood,

where we use a logistic function for the conditional probability, P(Y |X) = 1

1+exp(−θT X)

; the

logistic model can be considered to be a simple conditional random eld. Given a set of M

training examples, D ={(x[m], y[m])}M

m=1, we want to nd the θ

∗

that maximizes the likelihood

of the observed data:

1

In this section, we have deliberately glossed over many details as these are out of the scope of the class. If

you are interested in a more complete treatment of the material in this section, you may nd it helpful to review

the lecture notes for CS229 (Machine Learning), or to take EE364a (Convex Optimization I).

CS228 Programming Assignment #7 2

θ

∗ = argmax

θ

L(θ : D) = argmax

θ

QM

m=1 P(y[m]|x[m]; θ)

Equivalently, we can minimize the negative log likelihood:

θ

∗ = argmin

θ

− l(θ : D)

= argmin

θ

PM

m=1 − log P(y[m]|x[m]; θ)

We will use stochastic gradient descent to nd the parameter values θ

∗

. However, we will rst

revisit the regular (batch) gradient descent algorithm before describing the stochastic variant.

The key intuition behind the gradient descent algorithm is that by taking steps in the direction

of the negative gradient of −l(θ : D), which is the direction of steepest descent, we will eventually

converge to a minimizer of −l(θ : D). This intuition is formalized in the update rule for the

parameter vector θ that is used on each iteration of the algorithm

θ := θ − α∇θ[−l(θ : D)]

where α is a parameter which determines the size of the steps we will take in parameter space

(also known as the learning rate).

Given this update rule, the gradient descent algorithm takes an initial assignment to the

parameters θ0, and repeatedly uses the update rule described above to compute new values for

θ, until we have converged (for instance, when the gradient is suciently small). Notice that

this algorithm requires summing up the gradients over every single training example, which is

why it is known as the batch gradient descent algorithm.

In the case where computing the gradient over a single example is computationally expensive

(for instance, in CRF parameter learning), the batch algorithm can take very long to converge,

as it requires the gradients over all training examples to be computed before the algorithm can

perform an update. In such situations, an algorithm that approximates the gradient over the

entire training set by the gradient over a single (or a few) training examples, can be a far more

ecient alternative; such an algorithm is known as a stochastic gradient descent algorithm.

One disadvantage of using such a coarse approximation of the gradient, however, is that while

the algorithm gets θ close to the minimum more quickly, the parameter vector may now oscillate

about the true minimum. We therefore make one further tweak to the algorithm to help mitigate

this behavior, by slowly decrease the learning rate α to zero as the algorithm progresses. The

pseudocode of the stochastic gradient descent algorithm that we will use in this assignment is as

follows:

for k = 1 to max_iterations 2

:

Pick an arbitrary training example (x[m], y[m]), then update

θ := θ − αk∇θ[− log P(y[m]|x[m]; θ)]

where αk =

0.1

1+√

k

.

You should now implement the stochastic gradient descent algorithm:

• StochasticGradientDescent.m (10 Points) This function runs stochastic gradient descent

for the specied number of iterations and returns the value of θ upon termination. It takes

in three inputs:

1. a function handle for a function that computes the gradient of the negative loglikelihood of the data, at a particular θ and for a particular iteration number k. Note:

this function will return the gradient for an arbitrary training example, so you do not

have to explicitly pick the training example in your code for StochasticGradientDescent .

2We will use max_iterations=5000 in this part of the assignment.

CS228 Programming Assignment #7 3

2. an initial value for θ.

3. the number of iterations to run.

To test your implementation, we have provided you with LRTrainSGD.m. Sample data

can be found in Train1X.mat and Train1Y.mat which contain, respectively, the features

and ground truth labels for the training instances, in the right format for plugging into

LRTrainSGD.m. Go ahead and run LRTrainSGD.m on the training data with λ = 0 to get

the maximum likelihood estimates of the parameters. Then use LRPredict.m to get the

predictions for the training data, and LRAccuracy.m to calculate the classication accuracy

(where 1 = perfect accuracy) on the training data. You should obtain an accuracy of 0.96

on the training data.

2.2 Overtting & Regularization

Now try your model out on the provided test data in Test1X.mat and Test1Y.mat. Use

LRPredict.m and LRAccuracy.m to assess the performance of your model on the test set. Unfortunately, you will see that it’s somewhat less than the training set accuracy, which suggests

that the model has overt the training data. This problem of overtting can be addressed using

regularization. Concretely, we will add a new term to the negative log likelihood that penalizes

us for having large values for θ :

θ

∗ = argmin

θ

− l(θ; D) + λ

2

Pn

i=1 θ

2

i

This specic form of regularization is called L2-regularization, since we are penalizing the square

of the parameters (which is the square of their L2-norm). The code we have provided you in

LRTrainSGD.m already does this. But what value of λ should we pick?

One way to search for a good value of λ is by using a validation set. We can train our model

using the training set using various values of λ and then evaluate the models’ accuracies on the

validation set. Then, we can pick the value of λ that yields the best accuracy on the validation

set. Note that we do not use the test data to pick the value for λ, as we wish to use the test

data to obtain an estimate for the generalization performance of our model that is, how well

our model will perform on unseen data. If we were to use the test data to pick λ, our model will

be t to this data, and the model’s performance on this data will no longer be representative of

its performance on unseen data.

You will now implement a function that searches for a good value for λ:

• LRSearchLambdaSGD.m (10 Points) This function takes as inputs:

1. A matrix of features for the training data

2. A vector of ground truth labels for the training data

3. A matrix of features for the validation data

4. A vector of ground truth labels for the validation data

5. A vector of candidate values for the regularization parameter λ.

For each value of λ the function should t a logistic regression model to the training

data using LRTrainSGD.m and then evaluate the accuracy of the resulting model on

the validation data. The output should be a vector of the accuracy in the validation

set for each λ.

CS228 Programming Assignment #7 4

Test your function with Train1X.mat, Train1Y.mat, Validation1X.mat and Validation1Y.mat,

as well as the vector of candidate lambdas in Part1Lambdas.mat. Expected outputs from your

function can be found in ValidationAccuracy.mat. You should see that a little bit of regularization helps performance in the test data!

3 Learning CRF Parameters For OCR

3.1 Introduction

Now that you have the basic tools for parameter learning, gradient descent and regularization, we

can apply those tools to more complex models than logistic regression. For the remainder of this

assignment, you will implement the necessary functions to learn the parameters of a conditional

random eld for optical character recognition (OCR).

In the CRF model, there are two kinds of variables: the hidden variables that we want

to model and those that are always observed. In the case of OCR, we want to model the

character assignments (such as ‘a’ or ‘c’), and we always observe the character images, which are

arrays of pixel values. Typically, the unobserved variables are denoted by Y and the observed

variables are denoted by X. The CRF seeks to model P(Y |X), the conditional distribution over

character assignments given the observed images. We will be working with the OCR model from

programming assignment 3 that includes only the singleton and pairwise factors. The structure

of the model is shown below:

Figure 1: CRF with singleton and pairwise factors.

3.2 The Log-Linear Representation

Up until now, the graphical models you have constructed have focused on the table factor

representation. In this assignment, you will instead build the model around log-linear features.

A feature is just a function fi(Di) : Val(Di) → R, where Di

is the set of variables in the scope of

the ith feature; in this assignment, all features are binary indicator features (taking on a value

of either 0 or 1). Each feature has an associated weight θi

. Given the features {fi}

k

i=1 and the

weights {θi}

k

i=1, the distribution is dened as:

P(Y|x : θ) = 1

Zx(θ)

exp (X

k

i=1

θifi(Di)

)

.

The term Zx(θ) is the partition function:

Zx(θ) ≡

X

Y

exp (X

k

i=1

θifi(Di)

)

.

CS228 Programming Assignment #7 5

Note that computing Zx(θ) requires summing over an exponential number of assignments to

the variables, so we won’t be able to use a simple brute-force technique to compute P(Y|x : θ)

even though the feature values can be computed eciently.

In our CRF, we have three types of features:

• f

C

i,c(Yi), which operates on single characters / hidden states (an indicator for Yi = c).

• f

I

i,j,c,d(Yi

, xij ), which operates on a single character / hidden state and an image pixel

associated with that state (an indicator for Yi = c, xij = d). These are collectively used

to encode the individual probability that Yi = c given xi

.

• f

P

i,c,d(Yi

, Yi+1) which operates on a pair of adjacent characters / hidden states (an indicator

for Yi = c, Yi+1 = d).

3.3 Learning CRF Parameters: Theory

You will learn the parameters of the CRF in the same way you did for logistic regression in

Section 2. You need to write a function that takes a single data instance and computes the

cost of that instance and the gradient of the parameters with respect to the cost. Given such

a function, you can then use stochastic gradient descent to optimize the parameter values.

Recall that our goal is to maximize the log-likelihood of the parameters given the data. Thus

the cost we minimize is simply the negative log-likelihood, together with a L2-regularization

penalty on the parameter values to prevent overtting. The function we seek to minimize is:

nll(x, Y, θ) ≡ log(Zx(θ)) −

X

k

i=1

θifi(Y, x) + λ

2

X

k

i=1

θ

2

i

.

As you saw in lecture, the partial derivatives for this function have an elegant form:

∂

∂θi

nll(x, Y, θ) = Eθ[fi

] − ED[fi

] + λθi

.

In the derivative, there are two expectations: Eθ[fi

], the expectation of feature values with

respect to the model parameters, and ED[fi

], the expectation of the feature values with respect

to the given data instance D ≡ (X, y). Using the denition of expectation, we have:

Eθ[fi

] = X

Y0

P(Y0

|x)fi(Y0

, x),

ED[fi

] = fi(Y, x).

Note that we drop the θ from P(Y0

|x) for convenience.

In the rst equation, we sum over all possible assignments to the Y variables in the scope

of feature fi

. Since each feature has a small number of Y variables in its scope, this sum is

tractable. Unfortunately, computing the conditional probability P(Y0

|x) for each assignment

requires performing inference for the data instance x.

There is one more detail to take care of: shared parameters across multiple features. As

mentioned in Week 2’s lecture on Shared Features in Log-Linear Models, parameter sharing is a

form of templating used to reduce the total number of parameters we need to learn. Let

f

(i)

be the set of features that share parameter θi

. Then we can expand the equations above as:

nll(x, Y, θ) ≡ log(Zx(θ)) −

X

k

i=1

θi

X

fj∈{f

(i)}

fj (Y, x)

+

λ

2

X

k

i=1

θ

2

i

,

CS228 Programming Assignment #7 6

∂

∂θi

nll(x, Y, θ) = X

fj∈{f

(i)}

Eθ[fj ] −

X

fj∈{f

(i)}

ED[fj ] + λθi

.

Note that in this case, θi

is not necessarily related to fi any longer. Instead, θi

is associated

with a set of features

f

(i)

. k is the total number of parameters θi

, and not necessarily the

total number of features fj .

The parameters that we use in the CRF are:

• θ

C

c

, shared by

f

C

i,c(Yi)

i

.

• θ

I

c,d, shared by n

f

I

i,j,c,d(Yi

, xij )

o

i,j

.

• θ

P

c,d, shared by n

f

P

i,c,d(Yi

, Yi+1)

o

i

.

Essentially, this parameter tying scheme ties parameters across dierent locations together; that

is, a character at the start of the word shares the same parameters as a character at the end of

the word.

Given this discussion, we can now state your mission: for a given data instance (x, Y) and

a parameter setting θ, you must compute the cost function (regularized negative log-likelihood)

and the gradient of parameters with respect to that cost. Doing so involves six terms (ignoring

the issue of shared parameters in these):

• The log partition function: log(Zx(θ))

• The weighted feature counts: θifi(Y, x)

• The regularization cost: λ

2

Pk

i=1 θ

2

i

• The model expected feature counts: P

Y0 P(Y0

|x)fi(Y0

, x),

• The data feature counts: fi(Y, x)

• The regularization gradient term: λθi

Take note that you will have to incorporate parameter sharing into these terms.

Unlike previous assignments, how you structure your code is almost entirely up to you. We

will test only two functions:

1. CliqueTreeCalibrate.m: You should modify this function (from PA 4) to also compute

log(Zx(θ)).

2. InstanceNegLogLikelihood.m: You should ll out this function to compute the regularized negative log-likelihood and its gradient for a single data instance (x, Y).

But we are getting ahead of ourselves…

3.4 Learning CRF Parameters: Implementation

At a high level, the entirety of the assignment is to implement InstanceNegLogLikelihood.m.

Make sure to read the comments in the le: it contains a helpful discussion of the data structures

you are using. The features above might look very complicated, but in InstanceNegLogLikelihood

we’ve provided you with a function that generates all of the required features and stores them

in featureSet.

The steps you need to take in this function are:

CS228 Programming Assignment #7 7

1. Convert the featureSet into a clique tree with its factors lled in. This tree should be a

chain, so the rst clique has scope {Y1, Y2}, the second has scope {Y2, Y3}, and so on.

2. Modify CliqueTreeCalibrate.m and then call it to calibrate the tree and simultaneously

compute log(Zx(θ)).

3. Use the calibrated clique tree in conjunction with featureSet to compute the weighted

feature counts, the data feature counts, and the model expected feature counts.

4. Use these counts to compute the regularized negative log-likelihood and gradient.

We recommend that you do not try to compute all of these within InstanceNegLogLikelihood

itself. For reference, our solution denes six helper functions. We ask that you dene helper

functions within InstanceNegLogLikelihood.m so that we get these functions when you submit

your code. If there is an issue with this requirement, please contact the TAs.

In terms of testing your code, there are three tests for this part:

• LogZ (10 points): This test checks that you have correctly modied CliqueTreeCalibrate.m

(originally from PA4) to compute the variable logZ. Be sure to read the comment at the

top: we have modied the code for you to also keep track of message that are not normalized

(in the variable unnormalizedMessages). You will need this to compute logZ.

• CRFNegLogLikelihood (30 points): This test checks that you correctly compute nll, the

rst return value of InstanceNegLogLikelihood.m.

• CRFGradient (40 points): This test checks that you correctly compute grad, the second

return value of InstanceNegLogLikelihood.m.

For debugging, you can use the variables from Part2Sample.mat. It contains the following

values:

• sampleX, sampleY, sampleTheta, sampleModelParams: Input to InstanceNegLogLikelihood.m.

• sampleUncalibratedTree: The resulting clique tree you should construct from the featureSet.

This is also the input to CliqueTreeCalibrate to test the computation of logZ.

• sampleCalibratedTree: The same clique tree after calibration.

• sampleLogZ: The log partition function for this data instance (computed at the same time

as clique tree calibration).

• sampleFeatureCounts: The data feature counts for the input.

• sampleWeightedFeatureCounts: The weighted feature counts for the input.

• sampleModelFeatureCounts: The model expected feature counts for this input.

• sampleRegularizationCost: The regularization cost for the setting of theta.

• sampleRegularizationGradient: The gradient term introduced by regularization.

• sampleNLL: The output regularized negative log-likelihood.

• sampleGrad: The output gradient.

You can check for yourself that the equations for nll and grad hold. That is:

CS228 Programming Assignment #7 8

• sampleNLL == sampleLogZ – sum(sampleWeightedFeatureCounts)

+ sampleRegularizationCost

and

• sampleGrad == sampleModelFeatureCounts – sampleFeatureCounts

+ sampleRegularizationGradient.

3.5 Training the Full Model and Computational Concerns

You might have noticed that we are not asking you to train a full model with stochastic gradient

descent as part of the assignment. This is not because we feel it is unimportant: the entire point

of implementing InstanceNegLogLikelihood is to use it with stochastic gradient descent and a

large dataset to train a set of parameters that can make good predictions on new data.

Instead, we have left it out due to computational concerns. Our implementation is reasonably optimized and takes about two seconds per call to InstanceNegLogLikelihood. It is for

this reason that stochastic gradient descent is a good choice of optimization algorithm. Other

methods require that we compute the average gradient over the entire dataset before making a

single parameter update. With 100 examples and two seconds per example, we would require 3

minutes for every parameter update (and batch gradient descent will typically require hundreds

of updates to converge).

With stochastic gradient descent, one can get reasonable results after only a few passes

through the data. Even so, training takes on the order of ten to twenty minutes. In the grand

scheme of things, this is not that much, but we do not want the bulk of implementation eort

to be spent on some of the optimization tricks we employed.

As a result, training and testing the full CRF model is an optional component of the assignment that you may complete for extra credit. In Part2FullDataset.mat, we provide trainData

and testData. Each is a struct array with X and y elds. In trainData, each X/y pair is the input

the InstanceNegLogLikelihood: X gives the observations and y gives the labels. In testData,

X is still the observations, but y is a dummy array of all zeros, so the data cannot be used for

training.

To train the full model, you should initialized the theta vector to zeros(1,2366) (the

features have a total of 2366 parameters); the model parameters are the same as the sample data,

so you can use those. You can then use stochastic gradient descent from Section 2 to update θ

based on the gradient for each instance that you compute using InstanceNegLogLikelihood.

To complete the extra credit, you should rst train the parameters theta on the trainData.

Given those, you need to write code to compute predicted character values on both the trainData

and the testData.

To submit the extra credit, you need to do three things:

1. Call the function SaveECPredictions(yourPredictions). yourPredictions should be

an 80 × 3 matrix of integers between 1 and 26, where yourPredictions(i,j) is the predicted value (1-26) of the j’th character in the i’th word, i.e., it is what testData(i).y(j)

should be. The matrix is 80×3 because there are 80 test examples, each with 3 characters.

2. Copy and paste the contents of any new les you have written into ExtraCredit.m. The

submit script needs to know the lenames in advance when we collect your code, but we

want to allow you to structure your code any way you like. By copying those new les into

this le, we will collect this le and have your code.

CS228 Programming Assignment #7 9

3. Run submit.m as per normal. Note that each time you change your model, you will need

to rerun SaveECPredictions. For computational reasons, our submit script will not invoke

this function automatically.

To prevent overtting on the test set, we will not give immediate feedback on the extra credit

submissions. The grades for these will be released after the hard deadline.

Your predictions should be produced by the code you write in ExtraCredit.m. In particular,

you should not attempt to label the test set manually, nor use 3rd-party software to solve this,

though you may feel free to use built-in Matlab/Octave commands.

If you get the CRF working (and producing predictions), you will get 10 points of extra

credit. Beyond this, you will get 0.5 point of extra credit per percentage accuracy point beyond

50%. For example, if you predict 75% of the test examples correctly, you will receive 10 + 12.5

points of extra credit. Without any extensions, and with a value of λ = 0.003, we obtained train

and test accuracies of 73% and 62%, respectively, after 5 passes through the data with stochastic

gradient descent.

Here are some possible ideas that you could explore to improve test accuracy for this extra

credit portion:

• Use cross-validation on the trainData to pick a good value of λ. If you don’t do this, you

could try λ = 0.003; we’ve found that it works pretty reasonably.

• Create more training data (e.g., by distorting the provided samples, or even collecting new

handwriting samples).

• Swap out stochastic gradient descent for a more sophisticated algorithm, e.g., L-BFGS

with mini-batching. Matlab and Octave both come with built-in function optimizers (see

fminunc).

• For the truly ambitious, add in more features (e.g., triplet features)! This will require

understanding how featureSet is generated, so read that code carefully.

Some of these are beyond the scope of this class, and you will have to do your own reading up

to implement them. Good luck!

4 Conclusion

Congratulations on completing a very challenging assignment! This assignment ties together

much of the course material, from representation (CRFs and template models) to inference, and

nally learning. We hope that you have managed to get a feel for the interplay between these

various aspects of graphical models. In particular, we hope that you have managed to see for

yourself why training an undirected model is a computationally dicult task, due to the fact that

we have to perform inference on every step of the optimization process; those of you who decided

to train the full model would be well aware of this fact. There are a variety of methods to help

improve the speed of learning a CRF, but many of these are beyond the scope of this class. If you

are interested, a detailed tutorial is available online at http://arxiv.org/pdf/1011.4088v1.