Description
1 Introduction
In PA8, you learned tree-structured Bayesian networks to represent body poses. In this assignment, you will be using the network you learned for human poses from PA8 to perform action
recognition on real-world KinectTMdata. The KinectTMdata was collected by performing actions
and extracting the human pose data associated with each action. We have collected data for the
following three actions: “clap”, “high kick”, and “low kick”. Your goal in this assignment is to
build a model that is able to identify the action performed.
The approach that we will take in this assignment is to model an action as a sequence of poses.
However, unlike in PA8, we do not have the class labels for these poses. More specifically, in PA8
we knew whether a pose belonged to a human or an alien, but in the KinectTMdata, we do not
know in advance what poses comprise an action. In fact, we do not even know what the possible
poses that comprise the various actions are. As such, we will have to build and learn models
that incorporate a latent variable for pose classes, which makes this a more challenging task than
the classification problem in PA8. You will make extensive use of the Expectation-Maximization
(EM) algorithm to learn these models with latent variables.
2 Data Representation
We have processed the action data into a sequence of constituent poses. Each action has a
di↵erent number of poses in the sequence depending on the length of the action. The constituent
poses are in the same format used in PA8: that is, each pose has the 10 body parts each with
the three (y, x, ↵) components representing the pose. Refer to Section 2 and Figure 1 from PA8
for the representation and notation of each of the parts.
More formally, you are given a set of actions a1,…,an. Each action ai is a sequence of
poses pi
1,…,pi
m, where the number of poses m is the length of the action. A pose pi
j can be
described by its body parts {Ok}10
k=1, with Ok = (yk, xk, ↵k). A pose is stored as a 10 x 3 matrix.
We will also use two of the data structures you saw in PA8, so take a moment to re-familiarize
yourself with them:
• P — This structure holds the model parameters, and is the same as the one used in PA8.
Slight changes are made to this structure for the section on HMM’s, which we describe
when they come up.
• G — This is the same graph structure used in PA8 for defining the tree structure dependencies between parts of a pose. Remember to handle all possible parameterizations of G
in your code.
CS228 Programming Assignment #9 2
3 Bayesian Clustering of the Poses
You will be experimenting with two latent variable models. The first is a Bayesian clustering
model, which you will use to cluster poses from di↵erent actions: we will combine the constituent
poses from all the action data and cluster them into groups of similar poses. Ideally, we will
automatically discover meaningful pose clusters where the majority of poses in the each cluster
are part of a single action.
The structure of the model is exactly as shown in Figure 1. This is similar to the model from
PA8, with a key di↵erence that the class variables C are now hidden variables; these variables
were observed in PA8 and given as part of the training data. As such, we no longer have a
completely labeled dataset to learn the model from, and we will use the EM algorithm to learn
the model from the partially labeled data.
In the clustering model, each cluster, or class, is associated with its own set of conditional
linear Gaussian (CLG) parameters for the tree-structured human pose network. If we have K
classes, then we will have K sets of CLG parameters. Our goal in the EM procedure is to learn
the CLG parameters for each class without knowledge of the true class labels for each pose. This
is similar to what you did in PA8, except that now we are not given the class label for each of
the poses. We describe the EM algorithm for learning the model in the following section.
C
O1
O2 O3
O4
O5
O6
Figure 1: Graphical model for Bayesian clustering of poses. The class variable is hidden here,
and we’ve only shown the first 6 body parts for clarity.
3.1 The EM algorithm for Bayesian Clustering
The general concept of EM is to generate a set of soft assignments for your hidden variables
(E-step) and use these assignments to update your estimates of the model parameters (M-step).
Each of these steps is done to maximize the model’s likelihood taking either parameters or unknown data as given, and we iterate back and forth between these steps until we have reached a
local maxima and can no longer improve the likelihood. We’ll take a look at the details of these
steps below:
E-step: In the E-step our goal is to do soft assignment of poses to classes. Thus, for each
pose, we infer the conditional probabilities for each of the K class labels using the current model
CS228 Programming Assignment #9 3
parameters. This can be done by first computing the joint probability of the class assignment
and pose (body parts), which decomposes as follows:
P(C = k, O1, ··· , O10) = P(C = k)
Y
10
i=1
P(Oi|C = k, Opa(i)) (1)
where pa(i) is the index of part i’s parent. Each of the probabilites in the product can be
computed as described in PA8, and vary depending on the tree structure used. Once we have
computed the joint probability of the class assignment and pose, we can then condition on the
pose to obtain the conditional class probability (which are our expected sucient statistics):
P(C = k|O1, ··· , O10) (2)
Note that this is the conditional class probability for a single pose, and you will need to compute
this probability for all poses in the dataset.
M-step: In M-step, we seek to find the maximum likelihood CLG parameters given our soft
assignments of poses to classes (conditional class probabilities) from the E-step. Thus, for each
class k, we fit the CLG parameters with each pose weighted by its conditional probability as
given in Equation 2. The weight of a particular pose for class k is equivalent to the current
estimated probability that the pose belongs to the class. This step is identical to the parameter
fitting step in PA8, with the only di↵erence that each pose now makes a weighted contribution
to the estimation of the CLG parameters for all classes (as opposed to only for the given class
in PA8).
3.2 Data Structures and Code
Now let’s get to the implementation. We have provided you with the general structure for the
EM algorithm and the expected input and output parameters. It’s up to you to implement the
rest!
Note: if you need to normalize probabilities, do this in the log probability space before converting
to probabilities. This is required to avoid numerical issues.
• EM cluster.m (35 points) — This is the function where we run the overall EM procedure
for clustering. This function takes as input the dataset of poses, a known graph structure
G, a set of initial probabilities, and the maximum number of iterations to be run. Descriptions of the input and output parameters are given as comments in the code. Follow the
comments and structure given in the code to complete this function. Note that we start
with the M-step first.
Note on decreasing likelihood: You may notice that the log likelihood can decrease sometimes. This is because we are only printing out the log likelihood of the data. Since we smooth
the variances of our Gaussians to avoid numerical issues, the log likelihood of this smoothing
prior can a↵ect things. If we calculated the log likelihood of the data together with the prior,
that number should always increase. The same holds in the following sections as well.
You may find the following functions that we’ve provided useful in your implementation:
• FitGaussianParameters.m — This is similar to the function you wrote in PA8, while
allowing for weighted training examples. The weights control the influence of each example
on the parameters learned.
CS228 Programming Assignment #9 4
• FitLinearGaussianParameters.m — Similar to the function you wrote in PA8, with the
same addition of weights as in FitGaussianParameters.m.
• lognormpdf.m — Computes the natural logarithm of the normal probability density function.
• logsumexp.m — Computes the log(sum(exp(X))) of the input X. This function’s implementation avoids numerical issues when we must add probabilities that are currently in
log space.
Relevant Data Structures:
• poseData — N x 10 x 3 matrix that contains the pose data, where N is the number of
poses, and each pose is associated with a 10 x 3 matrix that describes its parts, similar to
PA8. The poseData is created by randomly sampling poses from all of the actions.
• ClassProb — N x K matrix that contains the conditional class probabilities of the N
examples to the K classes. ClassProb(i, j) is the conditional probability that example i
belongs to class j.
Sample Input/Output: To test your code, a seperate file PA9SampleCases.mat contains the input and correct output for each of the sample cases that we test for in the submit script (Similar to PA8). For argument j of the function call in part i, you should use
exampleINPUT.t#ia#j (replacing the #j with j). For output, look at exampleOUTPUT.t#io#j
for argument j of the output to part i.
3.3 Evaluating the Clusters
With the implementation complete for the clustering algorithm, we can proceed to see how well
the clustering algorithm works in practice. We have given you a pre-extracted pose dataset,
a graph structure, and a set of initial probabilities in the file PA9Data.mat. You can also
experiment on your own with di↵erent graph structures and initial probabilities.
Try running the following command, which clusters the poses into 3 clusters:
[P loglikelihood ClassProb] = EM_cluster(poseData1, G, InitialClassProb1, 20)
With the posterior class probabilities ClassProb, we can obtain the cluster that each pose
associates most strongly with, and visualize the poses in each of these clusters. You can visualize
a set of poses using the function VisualizeDataset.m. These two steps can be done with the
following commands to visualize poses in cluster i:
[maxprob, assignments] = max(ClassProb, [], 2)
VisualizeDataset(poseData1(assignments == i, :, :))
We can see from the visualized poses that the clusters are not perfect, as the KinectTMdata is
not as well-behaved as the synthetic data used in PA8. In addition, we can look at the ground
truth action labels in the labels1 vector, which gives the action class that each pose was taken
from. This can be done for poses in cluster i as follows:
labels1(assignments == i)’
CS228 Programming Assignment #9 5
We can see that the clusters typically do not consist of poses from a single action. This makes intuitive sense, as actions may share similar poses. For example, in the “high kick” and “low kick”
actions, it’s possible that there are poses that both have the leg raised to the waist. We’ve also
provided a set of initial probabilities for clustering into 6 clusters using the following command:
[P loglikelihood ClassProb] = EM_cluster(poseData1, G, InitialClassProb2, 20)
What happens when we have twice the number of clusters? Are the clusters more concentrated
with a single action class? Do the visualizations look better?
4 A Bayesian Classifier for Action Recognition
In the action recognition problem, we are given a set of N training examples D = {(ai, ci)}
N
i=1
consisting of actions ai and their associated classes ci; there are 3 di↵erent types of actions in
the training set, so ci 2 {1, 2, 3}, where 1 is “clap”, 2 is “high kick” and 3 is “low kick”. We
wish to build a model to classify unseen actions. To do so, we will use the general framework of
a Bayesian classifier: we model the joint distribution over an action A and its class C as
P(A, C) = P(A|C)P(C) (3)
The intuition behind this model is that each action has a set of distinct characteristics that we
can capture using the class conditional action model P(A|C).
We can classify an action a by picking the class assignment c⇤ with the highest posterior
probability P(c⇤|a) given the action:
c⇤ = arg max c
P(c|a) = arg max c
P(a|c)P(c)
P(a) = arg max c
P(a|c) (4)
The last equality follows because we have a uniform class distribution in the training set, and
the denominator P(a) is the same for each of the classes. Thus, given an unseen action, we can
classify it by simply computing P(a|c) over each class, and picking the class whose model yields
the highest likelihood.
We have yet to specify a key component of our classifier: the class conditional action model
P(A|C). What model should we use? One possibility would be to fit a Bayesian clustering model
(a mixture of poses model) for each action to model the types of poses that appear in each action
class. However, we can guess that this will probably not work, as suggested by the clustering
results in the previous section where many of these actions shared poses that look similar. Thus,
it is likely that the class conditional action models will be similar and the classifier will not
perform well.
A key observation about actions that will help us build a better action model is that though
the poses comprising an action may look similar, or even be the same, the sequence in which
these poses occur defines the action. For example, in the “low kick” action, we would expect a
sequence of poses in which the foot is lifted, kicked in a direction, and then returned back to
the original position. Thus, we should try to leverage the temporal nature of action poses in our
model. The mixture of poses model does not account for this temporal nature of actions, so we
will turn to a di↵erent model that allows us to leverage this information.
4.1 HMM action models
Using a Hidden Markov Model (HMM) for the action model will allow us to capture the sequential
nature of the data. Given an action ai consisting of a sequence of m poses pi
1,…,pi
m, we
CS228 Programming Assignment #9 6
can construct a HMM of length m with hidden state variables S1,…,Sm for each pose that
correspond to the unknown pose classes. The HMM action model defines a joint distribution
over the hidden state variables and the poses comprising an action of length m:
P(S1,…,Sm, P1,…,Pm) = P(S1)P(P1|S1)
Ym
i=2
P(Si|Si1)P(Pi|Si)
Since the HMM is a template model, it is parameterized by 3 CPDs – the prior distribution over
the initial states P(S1), the transition model P(S0
|S), and the emission model P(P|S). The first
two CPDs P(S1) and P(S0
|S) are table CPDs, while the emission model for pose Pj with parts
{Ok}10
k=1 is
P(Pj |S) = Y
10
i=1
P(Oi|S, Opa(i))
where pa(i) denotes the parent of node i, is actually the pose model that you worked with in
Section 2.4 of PA8, with the skeletal structure for humans that you learned in the assignment.
This equation is very similar to Equation 1, with the only di↵erence here being that we aren’t
accounting for prior class probabilities anymore. Figure 2 shows an example HMM for an action
consisting of a sequence of 3 poses, where we have explicitly shown the first 6 body parts that
comprise a pose.
S1
O1
O2 O3
O4
O5
O6
S2
O1
O2 O3
O4
O5
O6
S3
O1
O2 O3
O4
O5
O6
Figure 2: Graphical model for the HMM. In our HMM, each state variable represents the class
of the underlying pose. The emission probabilities for each pose are computed using the learned
parameters.
4.2 Learning the HMM action model using EM
Since the state variables are hidden, we will use the EM algorithm to learn the parameters of the
HMM. The general EM framework (iterating between estimating assignments and parameters)
does not change, but the specifics of these steps can be tricky so we will now describe each of
the steps in detail.
E-step: In the E-step, we would like to compute the expected sucient statistics needed to
estimate our parameters. The two sets of expected sucient statistics needed are the marginals
over each individual state variable Si, as well as the pairwise marginals over consecutive state
variables Si and Si1.
Though computing these marginals may seem daunting, the E-step of the algorithm can be implemented using clique tree inference, which you wrote in PA4 and used in PA7. We include
CS228 Programming Assignment #9 7
optimized solution code for the functions from PA4 to help with inference in your HMM model,
which you can call using ComputeExactMarginalsHMM.m. In particular, we have modified
it to work in log space to avoid numerical issues.
As described in Section 4.1, the HMM consists of 3 types of factors (CPDs) that you will
need to include in your clique-tree for performing inference in the E-step. Note that for a single
action ai with a sequence of m observed poses pi
1,…,pi
m, you will be able to reduce the factors (Pj , Sj ) = P(Pj |Sj ) by the observed poses pi
j to obtain singleton factors over Sj alone,
0
(Sj ) = (pi
j , Sj ). With these factors, we can run inference on the HMM for our action ai.
After running inference, we can extract the expected sucient statistics Mj [sj ] = P(Sj =
sj |ai) and M[s0
, s] = Pm
j=2 P(Sj = s0
, Sj1 = s|ai) from the calibrated clique tree for estimating
the initial state prior and the transition CPD, as explained below. Note that these statistics are
described for a single action, and you will need to aggregate these statistics across the entire
dataset of actions. We also extract the conditional class probabilities P(Sj = sj |ai) for each
state to estimate the emission CPD, as explained below. Also, as in PA7, we can compute the
likelihood of an action by marginalizing out the probabilities of any of the cliques.
M-step:
Using the aggregated expected sucient statistics, we can re-estimate the parameters in the
model:
• Initial state prior: The prior P(S1) can be re-estimated as:
P(S1 = s1) = M1[s1]
PK
k=1 M1[k]
• Transition CPD: We can re-estimate these using the expected sucient statistics as:
P(S0 = s0
|S = s) = M[s0
, s]
PK
k=1 M[k, s]
• Emission CPD: The parameters for the emission probabilities are the CLG parameters
used in pose clustering, and can be fit in the same way as the previous section, where we
treat each pose as having a weight equal to its conditional class probability, which reduces
to the M-step from pose clustering. Essentially, in this step you are doing pose clustering
over all the poses in all the actions.
4.3 Data Structures and Code
In practice, there are many numerical issues that arise when implementing the HMM. Thus, for
many of these functions, we would like to keep our probabilities in log space whenever possible,
and to always normalize in log space to avoid numerical issues. Before you start writing this
function, be sure to go over the details of the useful data structures and functions in Sections 5
and 6.
• EM HMM.m (35 points) — This is the function where you will implement the EM procedure
to learn the parameters of the action HMM. Many parts of this function should be the
same as portions of EM cluster.m. This function takes as input the dataset of actions, the
dataset of poses, a known graph structure G, a set of initial probabilities, and the number
of iterations to be run. Descriptions for the input and output parameters are given as
comments in the code. Follow the comments and structure given in the code to complete
this function. Note that we start with the M-step first.
CS228 Programming Assignment #9 8
4.4 Recognizing Actions
Now that you’ve implemented the EM algorithm to learn an HMM, we can now move on to our
end goal of action recognition. As discussed in Section 4, we will be using a Bayesian classifier,
which means that we will train a separate HMM for each action type, then classify actions based
on which HMM gives the highest likelihood for the action.
• RecognizeActions.m (10 points) — In this function, you should train an HMM for each
action class using the training set datasetTrain. Then, classify each of the instances in
the test set datasetTest and compute the classification accuracy. Details on the datasetTrain and datasetTest data structures can be found in Section 6.
At this point you have successfully implemented an action recognition system. Let’s see how
well this performs in practice. Run the command:
load PA9Data;
[accuracy, predicted_labels] = RecognizeActions(datasetTrain1, datasetTest1, G, 10)
If everything is implemented correctly, you should obtain an accuracy of 82%, which is
excellent accuracy on the recognition task; in comparison, random guessing will yield an accuracy
of 33%. You can use the VisualizeDataset.m function to visualize the actions. Try visualizing
some of the actions used to train the HMMs as well as the unknown actions that were classified
using the HMMs. Also, we have set the cardinality of the hidden state variables to be 3 for faster
training.
4.4.1 E↵ect of EM Initialization on Recognition Accuracy
In the lectures, we learnt that the initialization of the EM algorithm can significantly a↵ect
the quality of the model that it returns. You will now see this e↵ect for yourself. Execute the
following command, which uses the same datasets for training and testing, but uses a di↵erent
set of initialializations:
[accuracy, predicted_labels] = RecognizeActions(datasetTrain2, datasetTest2, G, 10)
You should see that the accuracy drops to 73%, which is 9% less than that obtained with
the previous initialization! Both of the initial model parameters were randomly sampled from
a uniform distribution, but the di↵erence in accuracy is significant. This occurs because of
the multiple local maxima present in the likelihood function: the latter initialization caused the
algorithm to converge to a di↵erent local maximum. At this particular local maxima, the learned
models find it harder to distinguish between the di↵erent actions. This example illustrates the
importance of initialization in learning HMMs, and more generally speaking, latent variable
models. Typically, we will want to construct an initial set of model parameters using knowledge
about the latent variables in our model, rather than picking these parameters randomly.
4.5 Extending the Model
Your final task is to refine and extend the code you’ve written for recognizing actions. We
have provided you with the variables datasetTrain3 and datasetTest3 in the PA9Data.mat
file. Your task is to train models on the datasetTrain3 data to recognize the actions in
datasetTest3. We’ve given you a random set of initial probabilities in datasetTrain3 that
don’t perform very well. Note that we do not give you the labels in datasetTest3, which is
similar to the extra credit scenario in PA7.
CS228 Programming Assignment #9 9
• RecognizeUnknownActions.m (20 points) — In this function, you should train a model
for each action class using the training set datasetTrain3. Then, classify each of the instances in the test set datasetTest3 and save the classifier predictions by calling SavePredictions.m.
This function is left empty for you to be creative in your approach (we give some suggestions
below). When you are done, write a short description of your method in YourMethod.txt,
which is submitted along with your predictions in the submit script. Your score for this
part will be determined by the percentage of unknown action instances you successfully
recognize. If you obtain an accuracy of x%, you will receive 0.2x points for this part.
To help you get started, here are some suggestions. Try to be creative in your approach, leveraging the ideas you have learned over the entire class. Good luck!
• Better initializations — Can you devise a better method for initializing the model’s
parameters that will help EM to find a good local maxima? For example, you can try
initializing the probabilities by clustering the poses together using a simple method like
K-means or Hierarchical Agglomerative Clustering (HAC). To evaluate your initalizations,
you may want to split your dataset into a training set and a validation set, so you can
check your performance on the validation set.
• Hidden state variables — In the sample data we had you run, we fixed the hidden
state variables Si to have cardinality of 3. However, it’s possible that there are more than
3 underlying pose classes. Is there an optimal cardinality for the hidden state variables?
Again, a validation set might be useful to evaluate your performance.
• Extending the HMM — Is the HMM the best model for our action data? You might try
implementing a more complicated model, such as a duration HMM. Another possibility is
to place restrictions on the states and their transitions. For example, we might be certain
that the first state is always the same pose, and that the transitions typically occur in a
fixed order. How can we encode this in the initial state probabilities and transition matrix?
• Generating more data — The amount of data you are given to learn the HMMs is not
large. Given that we are learning a large number of parameters, more data could be very
helpful. Can you think of ways of warping the existing instances you have to obtain more
training data?
5 Useful Functions for HMM
List of functions we have written for you that you may find useful:
• ComputeExactMarginalsHMM.m — Similar to the ComputeExactMarginalsBP.m function you wrote in PA4. This function takes as input a set of factors in log space for an
HMM, and runs clique tree inference in log space to avoid numerical issues. The function
returns marginals (normalized) for each variable, as well as the calibrated clique tree (unnormalized). Note that the messages are unnormalized, as this may be helpful in computing
the log likelihood.
• SavePrediction.m — Call this function to save your predictions for the unknown actions.
This will prepare the predictions in a mat file so that they can be submitted in the submit
script.
• VisualizeDataset.m — Helps to visualize a dataset of N poses of size N x 10 x 3.
CS228 Programming Assignment #9 10
6 Useful Data Structures for HMM
• P — This structure holds the model parameters. There are 2 changes made to this structure
for HMM’s that make this di↵erent from PA8 and the pose clustering section. First, there
is now a field P.transMatrix, which stores the K x K transition matrix, where K is the
number of possible classes. Second, the field P.c now contains the initial state prior
probabilities, instead of the prior class probability for all states.
• actionData — This is an array of structs that contains the actions. Each element of the
array is an instance of an action, with the fields action, marg ind, and pair ind defined
as follows:
– action — a string with the name of this action.
– marg ind — a set of indices that index the first dimensions in the associated poseData and ClassProb matrices that correspond to the sequence of poses that make up
this action. Thus, if marg ind= [10, 11,…, 19] then poseData(10, :, :), poseData(11,
:, :), …, poseData(19, :, 🙂 give the coordinates of this action’s 19-10+1=10 component poses. Additionally, rows 10 through 19 of ClassProb give the corresponding
class assignment probabilities of each of these individual poses.
– pair ind — a set of indices giving the rows in the PairProb data structure that
correspond to the pairwise transition probabilities for each pair of consecutive states
Si and Si1 in the action. Thus, if pair ind=[1, 2,… 9] then the pairwise transition
probability between the first and second poses in this action will be in row 1 of
PairProb, the transition between the 2nd and 3rd will be in row 2, and so forth.
• ClassProb — N x K matrix that contains the conditional class probabilities of the N
poses to the K classes. Note that N here is the total number of poses over all the actions.
ClassProb(i, j) is the conditional probability that pose i belongs to state j. Rows of this
matrix are indexed by the actionData structure. Using this matrix, you should be able
to estimate parameters in the M-step for the initial state prior and the emission CPDs.
• PairProb — V x K2 matrix that contains the pairwise transition probabilities (expected
sucient statistics for the transition CPD, M[s0
, s]) for every pair of consecutive states Si
and Si1 in all actions. Thus, every pair of consecutive states connected by an edge has
an entry in this matrix. V is the number of these edges. Rows of this matrix are indexed
by the actionData structure.
• datasetTrain — This is an array of structs that contains actions used for training. Each
element of the array is a struct for a di↵erent action, and contains the fields actionData,
poseData, InitialClassProb, and InitialPairProb, which are described above. The first
element of the array is “clap”, the second is “high kick”, and the third is “low kick”.
• datasetTest — This is the struct that contains actions used for testing. There are 3 fields:
actionData, poseData, and labels. The actionData and poseData are described above, and
the labels field is an N x 1 vector with the labels for each of the N action instances in
the datasetTest (with “clap” = 1, “high kick” = 2, “low kick” = 3). Note that for the
unknown test data, the labels field is not provided.
CS228 Programming Assignment #9 11
7 Conclusion
Congratulations! You’ve completed the final programming assignment for CS228! In this programming assignment, you’ve put together many of the ideas you’ve learned in class, ranging
from the basic factor data structures to advanced concepts like the Expectation-Maximization
algorithm, to create a system that is able to recognize human actions in real-world data. We
hope that you’ve enjoyed the assignment, and good luck for the final!