## Description

1 Introduction

In PA8, you learned tree-structured Bayesian networks to represent body poses. In this assignment, you will be using the network you learned for human poses from PA8 to perform action

recognition on real-world KinectTMdata. The KinectTMdata was collected by performing actions

and extracting the human pose data associated with each action. We have collected data for the

following three actions: “clap”, “high kick”, and “low kick”. Your goal in this assignment is to

build a model that is able to identify the action performed.

The approach that we will take in this assignment is to model an action as a sequence of poses.

However, unlike in PA8, we do not have the class labels for these poses. More specifically, in PA8

we knew whether a pose belonged to a human or an alien, but in the KinectTMdata, we do not

know in advance what poses comprise an action. In fact, we do not even know what the possible

poses that comprise the various actions are. As such, we will have to build and learn models

that incorporate a latent variable for pose classes, which makes this a more challenging task than

the classification problem in PA8. You will make extensive use of the Expectation-Maximization

(EM) algorithm to learn these models with latent variables.

2 Data Representation

We have processed the action data into a sequence of constituent poses. Each action has a

di↵erent number of poses in the sequence depending on the length of the action. The constituent

poses are in the same format used in PA8: that is, each pose has the 10 body parts each with

the three (y, x, ↵) components representing the pose. Refer to Section 2 and Figure 1 from PA8

for the representation and notation of each of the parts.

More formally, you are given a set of actions a1,…,an. Each action ai is a sequence of

poses pi

1,…,pi

m, where the number of poses m is the length of the action. A pose pi

j can be

described by its body parts {Ok}10

k=1, with Ok = (yk, xk, ↵k). A pose is stored as a 10 x 3 matrix.

We will also use two of the data structures you saw in PA8, so take a moment to re-familiarize

yourself with them:

• P — This structure holds the model parameters, and is the same as the one used in PA8.

Slight changes are made to this structure for the section on HMM’s, which we describe

when they come up.

• G — This is the same graph structure used in PA8 for defining the tree structure dependencies between parts of a pose. Remember to handle all possible parameterizations of G

in your code.

CS228 Programming Assignment #9 2

3 Bayesian Clustering of the Poses

You will be experimenting with two latent variable models. The first is a Bayesian clustering

model, which you will use to cluster poses from di↵erent actions: we will combine the constituent

poses from all the action data and cluster them into groups of similar poses. Ideally, we will

automatically discover meaningful pose clusters where the majority of poses in the each cluster

are part of a single action.

The structure of the model is exactly as shown in Figure 1. This is similar to the model from

PA8, with a key di↵erence that the class variables C are now hidden variables; these variables

were observed in PA8 and given as part of the training data. As such, we no longer have a

completely labeled dataset to learn the model from, and we will use the EM algorithm to learn

the model from the partially labeled data.

In the clustering model, each cluster, or class, is associated with its own set of conditional

linear Gaussian (CLG) parameters for the tree-structured human pose network. If we have K

classes, then we will have K sets of CLG parameters. Our goal in the EM procedure is to learn

the CLG parameters for each class without knowledge of the true class labels for each pose. This

is similar to what you did in PA8, except that now we are not given the class label for each of

the poses. We describe the EM algorithm for learning the model in the following section.

C

O1

O2 O3

O4

O5

O6

Figure 1: Graphical model for Bayesian clustering of poses. The class variable is hidden here,

and we’ve only shown the first 6 body parts for clarity.

3.1 The EM algorithm for Bayesian Clustering

The general concept of EM is to generate a set of soft assignments for your hidden variables

(E-step) and use these assignments to update your estimates of the model parameters (M-step).

Each of these steps is done to maximize the model’s likelihood taking either parameters or unknown data as given, and we iterate back and forth between these steps until we have reached a

local maxima and can no longer improve the likelihood. We’ll take a look at the details of these

steps below:

E-step: In the E-step our goal is to do soft assignment of poses to classes. Thus, for each

pose, we infer the conditional probabilities for each of the K class labels using the current model

CS228 Programming Assignment #9 3

parameters. This can be done by first computing the joint probability of the class assignment

and pose (body parts), which decomposes as follows:

P(C = k, O1, ··· , O10) = P(C = k)

Y

10

i=1

P(Oi|C = k, Opa(i)) (1)

where pa(i) is the index of part i’s parent. Each of the probabilites in the product can be

computed as described in PA8, and vary depending on the tree structure used. Once we have

computed the joint probability of the class assignment and pose, we can then condition on the

pose to obtain the conditional class probability (which are our expected sucient statistics):

P(C = k|O1, ··· , O10) (2)

Note that this is the conditional class probability for a single pose, and you will need to compute

this probability for all poses in the dataset.

M-step: In M-step, we seek to find the maximum likelihood CLG parameters given our soft

assignments of poses to classes (conditional class probabilities) from the E-step. Thus, for each

class k, we fit the CLG parameters with each pose weighted by its conditional probability as

given in Equation 2. The weight of a particular pose for class k is equivalent to the current

estimated probability that the pose belongs to the class. This step is identical to the parameter

fitting step in PA8, with the only di↵erence that each pose now makes a weighted contribution

to the estimation of the CLG parameters for all classes (as opposed to only for the given class

in PA8).

3.2 Data Structures and Code

Now let’s get to the implementation. We have provided you with the general structure for the

EM algorithm and the expected input and output parameters. It’s up to you to implement the

rest!

Note: if you need to normalize probabilities, do this in the log probability space before converting

to probabilities. This is required to avoid numerical issues.

• EM cluster.m (35 points) — This is the function where we run the overall EM procedure

for clustering. This function takes as input the dataset of poses, a known graph structure

G, a set of initial probabilities, and the maximum number of iterations to be run. Descriptions of the input and output parameters are given as comments in the code. Follow the

comments and structure given in the code to complete this function. Note that we start

with the M-step first.

Note on decreasing likelihood: You may notice that the log likelihood can decrease sometimes. This is because we are only printing out the log likelihood of the data. Since we smooth

the variances of our Gaussians to avoid numerical issues, the log likelihood of this smoothing

prior can a↵ect things. If we calculated the log likelihood of the data together with the prior,

that number should always increase. The same holds in the following sections as well.

You may find the following functions that we’ve provided useful in your implementation:

• FitGaussianParameters.m — This is similar to the function you wrote in PA8, while

allowing for weighted training examples. The weights control the influence of each example

on the parameters learned.

CS228 Programming Assignment #9 4

• FitLinearGaussianParameters.m — Similar to the function you wrote in PA8, with the

same addition of weights as in FitGaussianParameters.m.

• lognormpdf.m — Computes the natural logarithm of the normal probability density function.

• logsumexp.m — Computes the log(sum(exp(X))) of the input X. This function’s implementation avoids numerical issues when we must add probabilities that are currently in

log space.

Relevant Data Structures:

• poseData — N x 10 x 3 matrix that contains the pose data, where N is the number of

poses, and each pose is associated with a 10 x 3 matrix that describes its parts, similar to

PA8. The poseData is created by randomly sampling poses from all of the actions.

• ClassProb — N x K matrix that contains the conditional class probabilities of the N

examples to the K classes. ClassProb(i, j) is the conditional probability that example i

belongs to class j.

Sample Input/Output: To test your code, a seperate file PA9SampleCases.mat contains the input and correct output for each of the sample cases that we test for in the submit script (Similar to PA8). For argument j of the function call in part i, you should use

exampleINPUT.t#ia#j (replacing the #j with j). For output, look at exampleOUTPUT.t#io#j

for argument j of the output to part i.

3.3 Evaluating the Clusters

With the implementation complete for the clustering algorithm, we can proceed to see how well

the clustering algorithm works in practice. We have given you a pre-extracted pose dataset,

a graph structure, and a set of initial probabilities in the file PA9Data.mat. You can also

experiment on your own with di↵erent graph structures and initial probabilities.

Try running the following command, which clusters the poses into 3 clusters:

[P loglikelihood ClassProb] = EM_cluster(poseData1, G, InitialClassProb1, 20)

With the posterior class probabilities ClassProb, we can obtain the cluster that each pose

associates most strongly with, and visualize the poses in each of these clusters. You can visualize

a set of poses using the function VisualizeDataset.m. These two steps can be done with the

following commands to visualize poses in cluster i:

[maxprob, assignments] = max(ClassProb, [], 2)

VisualizeDataset(poseData1(assignments == i, :, :))

We can see from the visualized poses that the clusters are not perfect, as the KinectTMdata is

not as well-behaved as the synthetic data used in PA8. In addition, we can look at the ground

truth action labels in the labels1 vector, which gives the action class that each pose was taken

from. This can be done for poses in cluster i as follows:

labels1(assignments == i)’

CS228 Programming Assignment #9 5

We can see that the clusters typically do not consist of poses from a single action. This makes intuitive sense, as actions may share similar poses. For example, in the “high kick” and “low kick”

actions, it’s possible that there are poses that both have the leg raised to the waist. We’ve also

provided a set of initial probabilities for clustering into 6 clusters using the following command:

[P loglikelihood ClassProb] = EM_cluster(poseData1, G, InitialClassProb2, 20)

What happens when we have twice the number of clusters? Are the clusters more concentrated

with a single action class? Do the visualizations look better?

4 A Bayesian Classifier for Action Recognition

In the action recognition problem, we are given a set of N training examples D = {(ai, ci)}

N

i=1

consisting of actions ai and their associated classes ci; there are 3 di↵erent types of actions in

the training set, so ci 2 {1, 2, 3}, where 1 is “clap”, 2 is “high kick” and 3 is “low kick”. We

wish to build a model to classify unseen actions. To do so, we will use the general framework of

a Bayesian classifier: we model the joint distribution over an action A and its class C as

P(A, C) = P(A|C)P(C) (3)

The intuition behind this model is that each action has a set of distinct characteristics that we

can capture using the class conditional action model P(A|C).

We can classify an action a by picking the class assignment c⇤ with the highest posterior

probability P(c⇤|a) given the action:

c⇤ = arg max c

P(c|a) = arg max c

P(a|c)P(c)

P(a) = arg max c

P(a|c) (4)

The last equality follows because we have a uniform class distribution in the training set, and

the denominator P(a) is the same for each of the classes. Thus, given an unseen action, we can

classify it by simply computing P(a|c) over each class, and picking the class whose model yields

the highest likelihood.

We have yet to specify a key component of our classifier: the class conditional action model

P(A|C). What model should we use? One possibility would be to fit a Bayesian clustering model

(a mixture of poses model) for each action to model the types of poses that appear in each action

class. However, we can guess that this will probably not work, as suggested by the clustering

results in the previous section where many of these actions shared poses that look similar. Thus,

it is likely that the class conditional action models will be similar and the classifier will not

perform well.

A key observation about actions that will help us build a better action model is that though

the poses comprising an action may look similar, or even be the same, the sequence in which

these poses occur defines the action. For example, in the “low kick” action, we would expect a

sequence of poses in which the foot is lifted, kicked in a direction, and then returned back to

the original position. Thus, we should try to leverage the temporal nature of action poses in our

model. The mixture of poses model does not account for this temporal nature of actions, so we

will turn to a di↵erent model that allows us to leverage this information.

4.1 HMM action models

Using a Hidden Markov Model (HMM) for the action model will allow us to capture the sequential

nature of the data. Given an action ai consisting of a sequence of m poses pi

1,…,pi

m, we

CS228 Programming Assignment #9 6

can construct a HMM of length m with hidden state variables S1,…,Sm for each pose that

correspond to the unknown pose classes. The HMM action model defines a joint distribution

over the hidden state variables and the poses comprising an action of length m:

P(S1,…,Sm, P1,…,Pm) = P(S1)P(P1|S1)

Ym

i=2

P(Si|Si1)P(Pi|Si)

Since the HMM is a template model, it is parameterized by 3 CPDs – the prior distribution over

the initial states P(S1), the transition model P(S0

|S), and the emission model P(P|S). The first

two CPDs P(S1) and P(S0

|S) are table CPDs, while the emission model for pose Pj with parts

{Ok}10

k=1 is

P(Pj |S) = Y

10

i=1

P(Oi|S, Opa(i))

where pa(i) denotes the parent of node i, is actually the pose model that you worked with in

Section 2.4 of PA8, with the skeletal structure for humans that you learned in the assignment.

This equation is very similar to Equation 1, with the only di↵erence here being that we aren’t

accounting for prior class probabilities anymore. Figure 2 shows an example HMM for an action

consisting of a sequence of 3 poses, where we have explicitly shown the first 6 body parts that

comprise a pose.

S1

O1

O2 O3

O4

O5

O6

S2

O1

O2 O3

O4

O5

O6

S3

O1

O2 O3

O4

O5

O6

Figure 2: Graphical model for the HMM. In our HMM, each state variable represents the class

of the underlying pose. The emission probabilities for each pose are computed using the learned

parameters.

4.2 Learning the HMM action model using EM

Since the state variables are hidden, we will use the EM algorithm to learn the parameters of the

HMM. The general EM framework (iterating between estimating assignments and parameters)

does not change, but the specifics of these steps can be tricky so we will now describe each of

the steps in detail.

E-step: In the E-step, we would like to compute the expected sucient statistics needed to

estimate our parameters. The two sets of expected sucient statistics needed are the marginals

over each individual state variable Si, as well as the pairwise marginals over consecutive state

variables Si and Si1.

Though computing these marginals may seem daunting, the E-step of the algorithm can be implemented using clique tree inference, which you wrote in PA4 and used in PA7. We include

CS228 Programming Assignment #9 7

optimized solution code for the functions from PA4 to help with inference in your HMM model,

which you can call using ComputeExactMarginalsHMM.m. In particular, we have modified

it to work in log space to avoid numerical issues.

As described in Section 4.1, the HMM consists of 3 types of factors (CPDs) that you will

need to include in your clique-tree for performing inference in the E-step. Note that for a single

action ai with a sequence of m observed poses pi

1,…,pi

m, you will be able to reduce the factors (Pj , Sj ) = P(Pj |Sj ) by the observed poses pi

j to obtain singleton factors over Sj alone,

0

(Sj ) = (pi

j , Sj ). With these factors, we can run inference on the HMM for our action ai.

After running inference, we can extract the expected sucient statistics Mj [sj ] = P(Sj =

sj |ai) and M[s0

, s] = Pm

j=2 P(Sj = s0

, Sj1 = s|ai) from the calibrated clique tree for estimating

the initial state prior and the transition CPD, as explained below. Note that these statistics are

described for a single action, and you will need to aggregate these statistics across the entire

dataset of actions. We also extract the conditional class probabilities P(Sj = sj |ai) for each

state to estimate the emission CPD, as explained below. Also, as in PA7, we can compute the

likelihood of an action by marginalizing out the probabilities of any of the cliques.

M-step:

Using the aggregated expected sucient statistics, we can re-estimate the parameters in the

model:

• Initial state prior: The prior P(S1) can be re-estimated as:

P(S1 = s1) = M1[s1]

PK

k=1 M1[k]

• Transition CPD: We can re-estimate these using the expected sucient statistics as:

P(S0 = s0

|S = s) = M[s0

, s]

PK

k=1 M[k, s]

• Emission CPD: The parameters for the emission probabilities are the CLG parameters

used in pose clustering, and can be fit in the same way as the previous section, where we

treat each pose as having a weight equal to its conditional class probability, which reduces

to the M-step from pose clustering. Essentially, in this step you are doing pose clustering

over all the poses in all the actions.

4.3 Data Structures and Code

In practice, there are many numerical issues that arise when implementing the HMM. Thus, for

many of these functions, we would like to keep our probabilities in log space whenever possible,

and to always normalize in log space to avoid numerical issues. Before you start writing this

function, be sure to go over the details of the useful data structures and functions in Sections 5

and 6.

• EM HMM.m (35 points) — This is the function where you will implement the EM procedure

to learn the parameters of the action HMM. Many parts of this function should be the

same as portions of EM cluster.m. This function takes as input the dataset of actions, the

dataset of poses, a known graph structure G, a set of initial probabilities, and the number

of iterations to be run. Descriptions for the input and output parameters are given as

comments in the code. Follow the comments and structure given in the code to complete

this function. Note that we start with the M-step first.

CS228 Programming Assignment #9 8

4.4 Recognizing Actions

Now that you’ve implemented the EM algorithm to learn an HMM, we can now move on to our

end goal of action recognition. As discussed in Section 4, we will be using a Bayesian classifier,

which means that we will train a separate HMM for each action type, then classify actions based

on which HMM gives the highest likelihood for the action.

• RecognizeActions.m (10 points) — In this function, you should train an HMM for each

action class using the training set datasetTrain. Then, classify each of the instances in

the test set datasetTest and compute the classification accuracy. Details on the datasetTrain and datasetTest data structures can be found in Section 6.

At this point you have successfully implemented an action recognition system. Let’s see how

well this performs in practice. Run the command:

load PA9Data;

[accuracy, predicted_labels] = RecognizeActions(datasetTrain1, datasetTest1, G, 10)

If everything is implemented correctly, you should obtain an accuracy of 82%, which is

excellent accuracy on the recognition task; in comparison, random guessing will yield an accuracy

of 33%. You can use the VisualizeDataset.m function to visualize the actions. Try visualizing

some of the actions used to train the HMMs as well as the unknown actions that were classified

using the HMMs. Also, we have set the cardinality of the hidden state variables to be 3 for faster

training.

4.4.1 E↵ect of EM Initialization on Recognition Accuracy

In the lectures, we learnt that the initialization of the EM algorithm can significantly a↵ect

the quality of the model that it returns. You will now see this e↵ect for yourself. Execute the

following command, which uses the same datasets for training and testing, but uses a di↵erent

set of initialializations:

[accuracy, predicted_labels] = RecognizeActions(datasetTrain2, datasetTest2, G, 10)

You should see that the accuracy drops to 73%, which is 9% less than that obtained with

the previous initialization! Both of the initial model parameters were randomly sampled from

a uniform distribution, but the di↵erence in accuracy is significant. This occurs because of

the multiple local maxima present in the likelihood function: the latter initialization caused the

algorithm to converge to a di↵erent local maximum. At this particular local maxima, the learned

models find it harder to distinguish between the di↵erent actions. This example illustrates the

importance of initialization in learning HMMs, and more generally speaking, latent variable

models. Typically, we will want to construct an initial set of model parameters using knowledge

about the latent variables in our model, rather than picking these parameters randomly.

4.5 Extending the Model

Your final task is to refine and extend the code you’ve written for recognizing actions. We

have provided you with the variables datasetTrain3 and datasetTest3 in the PA9Data.mat

file. Your task is to train models on the datasetTrain3 data to recognize the actions in

datasetTest3. We’ve given you a random set of initial probabilities in datasetTrain3 that

don’t perform very well. Note that we do not give you the labels in datasetTest3, which is

similar to the extra credit scenario in PA7.

CS228 Programming Assignment #9 9

• RecognizeUnknownActions.m (20 points) — In this function, you should train a model

for each action class using the training set datasetTrain3. Then, classify each of the instances in the test set datasetTest3 and save the classifier predictions by calling SavePredictions.m.

This function is left empty for you to be creative in your approach (we give some suggestions

below). When you are done, write a short description of your method in YourMethod.txt,

which is submitted along with your predictions in the submit script. Your score for this

part will be determined by the percentage of unknown action instances you successfully

recognize. If you obtain an accuracy of x%, you will receive 0.2x points for this part.

To help you get started, here are some suggestions. Try to be creative in your approach, leveraging the ideas you have learned over the entire class. Good luck!

• Better initializations — Can you devise a better method for initializing the model’s

parameters that will help EM to find a good local maxima? For example, you can try

initializing the probabilities by clustering the poses together using a simple method like

K-means or Hierarchical Agglomerative Clustering (HAC). To evaluate your initalizations,

you may want to split your dataset into a training set and a validation set, so you can

check your performance on the validation set.

• Hidden state variables — In the sample data we had you run, we fixed the hidden

state variables Si to have cardinality of 3. However, it’s possible that there are more than

3 underlying pose classes. Is there an optimal cardinality for the hidden state variables?

Again, a validation set might be useful to evaluate your performance.

• Extending the HMM — Is the HMM the best model for our action data? You might try

implementing a more complicated model, such as a duration HMM. Another possibility is

to place restrictions on the states and their transitions. For example, we might be certain

that the first state is always the same pose, and that the transitions typically occur in a

fixed order. How can we encode this in the initial state probabilities and transition matrix?

• Generating more data — The amount of data you are given to learn the HMMs is not

large. Given that we are learning a large number of parameters, more data could be very

helpful. Can you think of ways of warping the existing instances you have to obtain more

training data?

5 Useful Functions for HMM

List of functions we have written for you that you may find useful:

• ComputeExactMarginalsHMM.m — Similar to the ComputeExactMarginalsBP.m function you wrote in PA4. This function takes as input a set of factors in log space for an

HMM, and runs clique tree inference in log space to avoid numerical issues. The function

returns marginals (normalized) for each variable, as well as the calibrated clique tree (unnormalized). Note that the messages are unnormalized, as this may be helpful in computing

the log likelihood.

• SavePrediction.m — Call this function to save your predictions for the unknown actions.

This will prepare the predictions in a mat file so that they can be submitted in the submit

script.

• VisualizeDataset.m — Helps to visualize a dataset of N poses of size N x 10 x 3.

CS228 Programming Assignment #9 10

6 Useful Data Structures for HMM

• P — This structure holds the model parameters. There are 2 changes made to this structure

for HMM’s that make this di↵erent from PA8 and the pose clustering section. First, there

is now a field P.transMatrix, which stores the K x K transition matrix, where K is the

number of possible classes. Second, the field P.c now contains the initial state prior

probabilities, instead of the prior class probability for all states.

• actionData — This is an array of structs that contains the actions. Each element of the

array is an instance of an action, with the fields action, marg ind, and pair ind defined

as follows:

– action — a string with the name of this action.

– marg ind — a set of indices that index the first dimensions in the associated poseData and ClassProb matrices that correspond to the sequence of poses that make up

this action. Thus, if marg ind= [10, 11,…, 19] then poseData(10, :, :), poseData(11,

:, :), …, poseData(19, :, 🙂 give the coordinates of this action’s 19-10+1=10 component poses. Additionally, rows 10 through 19 of ClassProb give the corresponding

class assignment probabilities of each of these individual poses.

– pair ind — a set of indices giving the rows in the PairProb data structure that

correspond to the pairwise transition probabilities for each pair of consecutive states

Si and Si1 in the action. Thus, if pair ind=[1, 2,… 9] then the pairwise transition

probability between the first and second poses in this action will be in row 1 of

PairProb, the transition between the 2nd and 3rd will be in row 2, and so forth.

• ClassProb — N x K matrix that contains the conditional class probabilities of the N

poses to the K classes. Note that N here is the total number of poses over all the actions.

ClassProb(i, j) is the conditional probability that pose i belongs to state j. Rows of this

matrix are indexed by the actionData structure. Using this matrix, you should be able

to estimate parameters in the M-step for the initial state prior and the emission CPDs.

• PairProb — V x K2 matrix that contains the pairwise transition probabilities (expected

sucient statistics for the transition CPD, M[s0

, s]) for every pair of consecutive states Si

and Si1 in all actions. Thus, every pair of consecutive states connected by an edge has

an entry in this matrix. V is the number of these edges. Rows of this matrix are indexed

by the actionData structure.

• datasetTrain — This is an array of structs that contains actions used for training. Each

element of the array is a struct for a di↵erent action, and contains the fields actionData,

poseData, InitialClassProb, and InitialPairProb, which are described above. The first

element of the array is “clap”, the second is “high kick”, and the third is “low kick”.

• datasetTest — This is the struct that contains actions used for testing. There are 3 fields:

actionData, poseData, and labels. The actionData and poseData are described above, and

the labels field is an N x 1 vector with the labels for each of the N action instances in

the datasetTest (with “clap” = 1, “high kick” = 2, “low kick” = 3). Note that for the

unknown test data, the labels field is not provided.

CS228 Programming Assignment #9 11

7 Conclusion

Congratulations! You’ve completed the final programming assignment for CS228! In this programming assignment, you’ve put together many of the ideas you’ve learned in class, ranging

from the basic factor data structures to advanced concepts like the Expectation-Maximization

algorithm, to create a system that is able to recognize human actions in real-world data. We

hope that you’ve enjoyed the assignment, and good luck for the final!