Description
1 Overview
In Programming Assignment 3, you learned how to construct a Markov network to perform the
task of optical character recognition (OCR). The values in each factor, however, were provided
for you. This assignment returns to the same OCR task, but this time you will write the code
to learn the factor values automatically from data.
This assignment is divided into two parts. In the rst part, we use a small (but real) parameter
learning task to introduce a handful of key concepts applicable to nearly any machine learning
problem:
• Stochastic gradient descent for parameter learning
• Training, validation, and test data
• Regularization
In the second part of the assignment, you will apply these techniques to the OCR task. In
particular, you will construct a Markov network similar to the one from PA3 and use stochastic
gradient descent to learn the parameters of the network from a set of training data.
Be sure to get a head start on this assignment! The rst part introduces a set of new concepts,
and the second is rather intricate. For the adventurous, we have included an extra credit section,
where you will be graded based on how well your OCR system performs. Good luck!
2 Machine Learning Primer
2.1 Stochastic Gradient Descent For Logistic Regression1
Consider the following task: given a dataset of images of either the letter p or q and the corresponding labels for the images, we wish to build a classier that will allow us to predict the
labels of new unseen images as accurately as possible. More concretely, we have a dataset in
which each data sample consists of 129 features (16 × 8 pixels + 1 bias term), X =X1, . . . , X129,
derived from the images, and the corresponding labels Y for the image (0 if it was an image of
a q, 1 if it was an image of a p).
We will learn a model of the conditional probability P(Y |X) using maximum likelihood,
where we use a logistic function for the conditional probability, P(Y |X) = 1
1+exp(−θT X)
; the
logistic model can be considered to be a simple conditional random eld. Given a set of M
training examples, D ={(x[m], y[m])}M
m=1, we want to nd the θ
∗
that maximizes the likelihood
of the observed data:
1
In this section, we have deliberately glossed over many details as these are out of the scope of the class. If
you are interested in a more complete treatment of the material in this section, you may nd it helpful to review
the lecture notes for CS229 (Machine Learning), or to take EE364a (Convex Optimization I).
CS228 Programming Assignment #7 2
θ
∗ = argmax
θ
L(θ : D) = argmax
θ
QM
m=1 P(y[m]|x[m]; θ)
Equivalently, we can minimize the negative log likelihood:
θ
∗ = argmin
θ
− l(θ : D)
= argmin
θ
PM
m=1 − log P(y[m]|x[m]; θ)
We will use stochastic gradient descent to nd the parameter values θ
∗
. However, we will rst
revisit the regular (batch) gradient descent algorithm before describing the stochastic variant.
The key intuition behind the gradient descent algorithm is that by taking steps in the direction
of the negative gradient of −l(θ : D), which is the direction of steepest descent, we will eventually
converge to a minimizer of −l(θ : D). This intuition is formalized in the update rule for the
parameter vector θ that is used on each iteration of the algorithm
θ := θ − α∇θ[−l(θ : D)]
where α is a parameter which determines the size of the steps we will take in parameter space
(also known as the learning rate).
Given this update rule, the gradient descent algorithm takes an initial assignment to the
parameters θ0, and repeatedly uses the update rule described above to compute new values for
θ, until we have converged (for instance, when the gradient is suciently small). Notice that
this algorithm requires summing up the gradients over every single training example, which is
why it is known as the batch gradient descent algorithm.
In the case where computing the gradient over a single example is computationally expensive
(for instance, in CRF parameter learning), the batch algorithm can take very long to converge,
as it requires the gradients over all training examples to be computed before the algorithm can
perform an update. In such situations, an algorithm that approximates the gradient over the
entire training set by the gradient over a single (or a few) training examples, can be a far more
ecient alternative; such an algorithm is known as a stochastic gradient descent algorithm.
One disadvantage of using such a coarse approximation of the gradient, however, is that while
the algorithm gets θ close to the minimum more quickly, the parameter vector may now oscillate
about the true minimum. We therefore make one further tweak to the algorithm to help mitigate
this behavior, by slowly decrease the learning rate α to zero as the algorithm progresses. The
pseudocode of the stochastic gradient descent algorithm that we will use in this assignment is as
follows:
for k = 1 to max_iterations 2
:
Pick an arbitrary training example (x[m], y[m]), then update
θ := θ − αk∇θ[− log P(y[m]|x[m]; θ)]
where αk =
0.1
1+√
k
.
You should now implement the stochastic gradient descent algorithm:
• StochasticGradientDescent.m (10 Points) This function runs stochastic gradient descent
for the specied number of iterations and returns the value of θ upon termination. It takes
in three inputs:
1. a function handle for a function that computes the gradient of the negative loglikelihood of the data, at a particular θ and for a particular iteration number k. Note:
this function will return the gradient for an arbitrary training example, so you do not
have to explicitly pick the training example in your code for StochasticGradientDescent .
2We will use max_iterations=5000 in this part of the assignment.
CS228 Programming Assignment #7 3
2. an initial value for θ.
3. the number of iterations to run.
To test your implementation, we have provided you with LRTrainSGD.m. Sample data
can be found in Train1X.mat and Train1Y.mat which contain, respectively, the features
and ground truth labels for the training instances, in the right format for plugging into
LRTrainSGD.m. Go ahead and run LRTrainSGD.m on the training data with λ = 0 to get
the maximum likelihood estimates of the parameters. Then use LRPredict.m to get the
predictions for the training data, and LRAccuracy.m to calculate the classication accuracy
(where 1 = perfect accuracy) on the training data. You should obtain an accuracy of 0.96
on the training data.
2.2 Overtting & Regularization
Now try your model out on the provided test data in Test1X.mat and Test1Y.mat. Use
LRPredict.m and LRAccuracy.m to assess the performance of your model on the test set. Unfortunately, you will see that it’s somewhat less than the training set accuracy, which suggests
that the model has overt the training data. This problem of overtting can be addressed using
regularization. Concretely, we will add a new term to the negative log likelihood that penalizes
us for having large values for θ :
θ
∗ = argmin
θ
− l(θ; D) + λ
2
Pn
i=1 θ
2
i
This specic form of regularization is called L2-regularization, since we are penalizing the square
of the parameters (which is the square of their L2-norm). The code we have provided you in
LRTrainSGD.m already does this. But what value of λ should we pick?
One way to search for a good value of λ is by using a validation set. We can train our model
using the training set using various values of λ and then evaluate the models’ accuracies on the
validation set. Then, we can pick the value of λ that yields the best accuracy on the validation
set. Note that we do not use the test data to pick the value for λ, as we wish to use the test
data to obtain an estimate for the generalization performance of our model that is, how well
our model will perform on unseen data. If we were to use the test data to pick λ, our model will
be t to this data, and the model’s performance on this data will no longer be representative of
its performance on unseen data.
You will now implement a function that searches for a good value for λ:
• LRSearchLambdaSGD.m (10 Points) This function takes as inputs:
1. A matrix of features for the training data
2. A vector of ground truth labels for the training data
3. A matrix of features for the validation data
4. A vector of ground truth labels for the validation data
5. A vector of candidate values for the regularization parameter λ.
For each value of λ the function should t a logistic regression model to the training
data using LRTrainSGD.m and then evaluate the accuracy of the resulting model on
the validation data. The output should be a vector of the accuracy in the validation
set for each λ.
CS228 Programming Assignment #7 4
Test your function with Train1X.mat, Train1Y.mat, Validation1X.mat and Validation1Y.mat,
as well as the vector of candidate lambdas in Part1Lambdas.mat. Expected outputs from your
function can be found in ValidationAccuracy.mat. You should see that a little bit of regularization helps performance in the test data!
3 Learning CRF Parameters For OCR
3.1 Introduction
Now that you have the basic tools for parameter learning, gradient descent and regularization, we
can apply those tools to more complex models than logistic regression. For the remainder of this
assignment, you will implement the necessary functions to learn the parameters of a conditional
random eld for optical character recognition (OCR).
In the CRF model, there are two kinds of variables: the hidden variables that we want
to model and those that are always observed. In the case of OCR, we want to model the
character assignments (such as ‘a’ or ‘c’), and we always observe the character images, which are
arrays of pixel values. Typically, the unobserved variables are denoted by Y and the observed
variables are denoted by X. The CRF seeks to model P(Y |X), the conditional distribution over
character assignments given the observed images. We will be working with the OCR model from
programming assignment 3 that includes only the singleton and pairwise factors. The structure
of the model is shown below:
Figure 1: CRF with singleton and pairwise factors.
3.2 The Log-Linear Representation
Up until now, the graphical models you have constructed have focused on the table factor
representation. In this assignment, you will instead build the model around log-linear features.
A feature is just a function fi(Di) : Val(Di) → R, where Di
is the set of variables in the scope of
the ith feature; in this assignment, all features are binary indicator features (taking on a value
of either 0 or 1). Each feature has an associated weight θi
. Given the features {fi}
k
i=1 and the
weights {θi}
k
i=1, the distribution is dened as:
P(Y|x : θ) = 1
Zx(θ)
exp (X
k
i=1
θifi(Di)
)
.
The term Zx(θ) is the partition function:
Zx(θ) ≡
X
Y
exp (X
k
i=1
θifi(Di)
)
.
CS228 Programming Assignment #7 5
Note that computing Zx(θ) requires summing over an exponential number of assignments to
the variables, so we won’t be able to use a simple brute-force technique to compute P(Y|x : θ)
even though the feature values can be computed eciently.
In our CRF, we have three types of features:
• f
C
i,c(Yi), which operates on single characters / hidden states (an indicator for Yi = c).
• f
I
i,j,c,d(Yi
, xij ), which operates on a single character / hidden state and an image pixel
associated with that state (an indicator for Yi = c, xij = d). These are collectively used
to encode the individual probability that Yi = c given xi
.
• f
P
i,c,d(Yi
, Yi+1) which operates on a pair of adjacent characters / hidden states (an indicator
for Yi = c, Yi+1 = d).
3.3 Learning CRF Parameters: Theory
You will learn the parameters of the CRF in the same way you did for logistic regression in
Section 2. You need to write a function that takes a single data instance and computes the
cost of that instance and the gradient of the parameters with respect to the cost. Given such
a function, you can then use stochastic gradient descent to optimize the parameter values.
Recall that our goal is to maximize the log-likelihood of the parameters given the data. Thus
the cost we minimize is simply the negative log-likelihood, together with a L2-regularization
penalty on the parameter values to prevent overtting. The function we seek to minimize is:
nll(x, Y, θ) ≡ log(Zx(θ)) −
X
k
i=1
θifi(Y, x) + λ
2
X
k
i=1
θ
2
i
.
As you saw in lecture, the partial derivatives for this function have an elegant form:
∂
∂θi
nll(x, Y, θ) = Eθ[fi
] − ED[fi
] + λθi
.
In the derivative, there are two expectations: Eθ[fi
], the expectation of feature values with
respect to the model parameters, and ED[fi
], the expectation of the feature values with respect
to the given data instance D ≡ (X, y). Using the denition of expectation, we have:
Eθ[fi
] = X
Y0
P(Y0
|x)fi(Y0
, x),
ED[fi
] = fi(Y, x).
Note that we drop the θ from P(Y0
|x) for convenience.
In the rst equation, we sum over all possible assignments to the Y variables in the scope
of feature fi
. Since each feature has a small number of Y variables in its scope, this sum is
tractable. Unfortunately, computing the conditional probability P(Y0
|x) for each assignment
requires performing inference for the data instance x.
There is one more detail to take care of: shared parameters across multiple features. As
mentioned in Week 2’s lecture on Shared Features in Log-Linear Models, parameter sharing is a
form of templating used to reduce the total number of parameters we need to learn. Let
f
(i)
be the set of features that share parameter θi
. Then we can expand the equations above as:
nll(x, Y, θ) ≡ log(Zx(θ)) −
X
k
i=1
θi
X
fj∈{f
(i)}
fj (Y, x)
+
λ
2
X
k
i=1
θ
2
i
,
CS228 Programming Assignment #7 6
∂
∂θi
nll(x, Y, θ) = X
fj∈{f
(i)}
Eθ[fj ] −
X
fj∈{f
(i)}
ED[fj ] + λθi
.
Note that in this case, θi
is not necessarily related to fi any longer. Instead, θi
is associated
with a set of features
f
(i)
. k is the total number of parameters θi
, and not necessarily the
total number of features fj .
The parameters that we use in the CRF are:
• θ
C
c
, shared by
f
C
i,c(Yi)
i
.
• θ
I
c,d, shared by n
f
I
i,j,c,d(Yi
, xij )
o
i,j
.
• θ
P
c,d, shared by n
f
P
i,c,d(Yi
, Yi+1)
o
i
.
Essentially, this parameter tying scheme ties parameters across dierent locations together; that
is, a character at the start of the word shares the same parameters as a character at the end of
the word.
Given this discussion, we can now state your mission: for a given data instance (x, Y) and
a parameter setting θ, you must compute the cost function (regularized negative log-likelihood)
and the gradient of parameters with respect to that cost. Doing so involves six terms (ignoring
the issue of shared parameters in these):
• The log partition function: log(Zx(θ))
• The weighted feature counts: θifi(Y, x)
• The regularization cost: λ
2
Pk
i=1 θ
2
i
• The model expected feature counts: P
Y0 P(Y0
|x)fi(Y0
, x),
• The data feature counts: fi(Y, x)
• The regularization gradient term: λθi
Take note that you will have to incorporate parameter sharing into these terms.
Unlike previous assignments, how you structure your code is almost entirely up to you. We
will test only two functions:
1. CliqueTreeCalibrate.m: You should modify this function (from PA 4) to also compute
log(Zx(θ)).
2. InstanceNegLogLikelihood.m: You should ll out this function to compute the regularized negative log-likelihood and its gradient for a single data instance (x, Y).
But we are getting ahead of ourselves…
3.4 Learning CRF Parameters: Implementation
At a high level, the entirety of the assignment is to implement InstanceNegLogLikelihood.m.
Make sure to read the comments in the le: it contains a helpful discussion of the data structures
you are using. The features above might look very complicated, but in InstanceNegLogLikelihood
we’ve provided you with a function that generates all of the required features and stores them
in featureSet.
The steps you need to take in this function are:
CS228 Programming Assignment #7 7
1. Convert the featureSet into a clique tree with its factors lled in. This tree should be a
chain, so the rst clique has scope {Y1, Y2}, the second has scope {Y2, Y3}, and so on.
2. Modify CliqueTreeCalibrate.m and then call it to calibrate the tree and simultaneously
compute log(Zx(θ)).
3. Use the calibrated clique tree in conjunction with featureSet to compute the weighted
feature counts, the data feature counts, and the model expected feature counts.
4. Use these counts to compute the regularized negative log-likelihood and gradient.
We recommend that you do not try to compute all of these within InstanceNegLogLikelihood
itself. For reference, our solution denes six helper functions. We ask that you dene helper
functions within InstanceNegLogLikelihood.m so that we get these functions when you submit
your code. If there is an issue with this requirement, please contact the TAs.
In terms of testing your code, there are three tests for this part:
• LogZ (10 points): This test checks that you have correctly modied CliqueTreeCalibrate.m
(originally from PA4) to compute the variable logZ. Be sure to read the comment at the
top: we have modied the code for you to also keep track of message that are not normalized
(in the variable unnormalizedMessages). You will need this to compute logZ.
• CRFNegLogLikelihood (30 points): This test checks that you correctly compute nll, the
rst return value of InstanceNegLogLikelihood.m.
• CRFGradient (40 points): This test checks that you correctly compute grad, the second
return value of InstanceNegLogLikelihood.m.
For debugging, you can use the variables from Part2Sample.mat. It contains the following
values:
• sampleX, sampleY, sampleTheta, sampleModelParams: Input to InstanceNegLogLikelihood.m.
• sampleUncalibratedTree: The resulting clique tree you should construct from the featureSet.
This is also the input to CliqueTreeCalibrate to test the computation of logZ.
• sampleCalibratedTree: The same clique tree after calibration.
• sampleLogZ: The log partition function for this data instance (computed at the same time
as clique tree calibration).
• sampleFeatureCounts: The data feature counts for the input.
• sampleWeightedFeatureCounts: The weighted feature counts for the input.
• sampleModelFeatureCounts: The model expected feature counts for this input.
• sampleRegularizationCost: The regularization cost for the setting of theta.
• sampleRegularizationGradient: The gradient term introduced by regularization.
• sampleNLL: The output regularized negative log-likelihood.
• sampleGrad: The output gradient.
You can check for yourself that the equations for nll and grad hold. That is:
CS228 Programming Assignment #7 8
• sampleNLL == sampleLogZ – sum(sampleWeightedFeatureCounts)
+ sampleRegularizationCost
and
• sampleGrad == sampleModelFeatureCounts – sampleFeatureCounts
+ sampleRegularizationGradient.
3.5 Training the Full Model and Computational Concerns
You might have noticed that we are not asking you to train a full model with stochastic gradient
descent as part of the assignment. This is not because we feel it is unimportant: the entire point
of implementing InstanceNegLogLikelihood is to use it with stochastic gradient descent and a
large dataset to train a set of parameters that can make good predictions on new data.
Instead, we have left it out due to computational concerns. Our implementation is reasonably optimized and takes about two seconds per call to InstanceNegLogLikelihood. It is for
this reason that stochastic gradient descent is a good choice of optimization algorithm. Other
methods require that we compute the average gradient over the entire dataset before making a
single parameter update. With 100 examples and two seconds per example, we would require 3
minutes for every parameter update (and batch gradient descent will typically require hundreds
of updates to converge).
With stochastic gradient descent, one can get reasonable results after only a few passes
through the data. Even so, training takes on the order of ten to twenty minutes. In the grand
scheme of things, this is not that much, but we do not want the bulk of implementation eort
to be spent on some of the optimization tricks we employed.
As a result, training and testing the full CRF model is an optional component of the assignment that you may complete for extra credit. In Part2FullDataset.mat, we provide trainData
and testData. Each is a struct array with X and y elds. In trainData, each X/y pair is the input
the InstanceNegLogLikelihood: X gives the observations and y gives the labels. In testData,
X is still the observations, but y is a dummy array of all zeros, so the data cannot be used for
training.
To train the full model, you should initialized the theta vector to zeros(1,2366) (the
features have a total of 2366 parameters); the model parameters are the same as the sample data,
so you can use those. You can then use stochastic gradient descent from Section 2 to update θ
based on the gradient for each instance that you compute using InstanceNegLogLikelihood.
To complete the extra credit, you should rst train the parameters theta on the trainData.
Given those, you need to write code to compute predicted character values on both the trainData
and the testData.
To submit the extra credit, you need to do three things:
1. Call the function SaveECPredictions(yourPredictions). yourPredictions should be
an 80 × 3 matrix of integers between 1 and 26, where yourPredictions(i,j) is the predicted value (1-26) of the j’th character in the i’th word, i.e., it is what testData(i).y(j)
should be. The matrix is 80×3 because there are 80 test examples, each with 3 characters.
2. Copy and paste the contents of any new les you have written into ExtraCredit.m. The
submit script needs to know the lenames in advance when we collect your code, but we
want to allow you to structure your code any way you like. By copying those new les into
this le, we will collect this le and have your code.
CS228 Programming Assignment #7 9
3. Run submit.m as per normal. Note that each time you change your model, you will need
to rerun SaveECPredictions. For computational reasons, our submit script will not invoke
this function automatically.
To prevent overtting on the test set, we will not give immediate feedback on the extra credit
submissions. The grades for these will be released after the hard deadline.
Your predictions should be produced by the code you write in ExtraCredit.m. In particular,
you should not attempt to label the test set manually, nor use 3rd-party software to solve this,
though you may feel free to use built-in Matlab/Octave commands.
If you get the CRF working (and producing predictions), you will get 10 points of extra
credit. Beyond this, you will get 0.5 point of extra credit per percentage accuracy point beyond
50%. For example, if you predict 75% of the test examples correctly, you will receive 10 + 12.5
points of extra credit. Without any extensions, and with a value of λ = 0.003, we obtained train
and test accuracies of 73% and 62%, respectively, after 5 passes through the data with stochastic
gradient descent.
Here are some possible ideas that you could explore to improve test accuracy for this extra
credit portion:
• Use cross-validation on the trainData to pick a good value of λ. If you don’t do this, you
could try λ = 0.003; we’ve found that it works pretty reasonably.
• Create more training data (e.g., by distorting the provided samples, or even collecting new
handwriting samples).
• Swap out stochastic gradient descent for a more sophisticated algorithm, e.g., L-BFGS
with mini-batching. Matlab and Octave both come with built-in function optimizers (see
fminunc).
• For the truly ambitious, add in more features (e.g., triplet features)! This will require
understanding how featureSet is generated, so read that code carefully.
Some of these are beyond the scope of this class, and you will have to do your own reading up
to implement them. Good luck!
4 Conclusion
Congratulations on completing a very challenging assignment! This assignment ties together
much of the course material, from representation (CRFs and template models) to inference, and
nally learning. We hope that you have managed to get a feel for the interplay between these
various aspects of graphical models. In particular, we hope that you have managed to see for
yourself why training an undirected model is a computationally dicult task, due to the fact that
we have to perform inference on every step of the optimization process; those of you who decided
to train the full model would be well aware of this fact. There are a variety of methods to help
improve the speed of learning a CRF, but many of these are beyond the scope of this class. If you
are interested, a detailed tutorial is available online at http://arxiv.org/pdf/1011.4088v1.