## Description

1 Undirected Graphical Models

1.1 Conditional UGM

Consdier modeling the dependencies between sets of binary variables xj and yj with the following UGM

which is a variation on a stacked RBM:

Computing univariate marginals in this model will be NP-hard in general, but the graph structure allows

efficient block updates by conditioning on suitable subsets of the variables (this could be useful for designing

approximate inference methods). For each of the conditioning scenarios below, draw the conditional UGM

and comment on how expensive it would be to compute univariate marginals in the conditional UGM.

1. Conditioning on all the z and h values.

2. Conditioning on all the x and h values.

3. Conditioning on all the z and y values.

4. Conditioning on all the x and z values.

1

1.2 Fitting a UGM to PINs

The function example UGM loads a dataset X containing samples of PIN numbers, based on the probabilities

from the article at this URL: http://www.datagenetics.com/blog/september32012.

1

This function shows how to use the UGM software to fit a UGM model to the dataset, where all node/edge

parameters are tied and the graph is empty. It then performs decoding/inference/sampling in the fitted

model. Unfortunately, this is not a very good model of the data for several reasons:

1. The decoding is 1 1 1 1, whereas in the data the most likely value by far is 1 2 3 4. Similarly, the

sampler doesn’t tend to generate 1 2 3 4 even though this happens in more than 1/10 samples.

2. The marginal probability of the first number being 1 is 22.06%, which is acutally too low (it should

be 38.54%). In addition, the marginal probabilities of the remaining nubmers being 1 are also 22.06%,

and these numbers are too high.

3. Conditioned on the first three numbers being 1 2 3, the probability that the last number is 4 is less

than 10% in the model, whereas in the data it’s more than 90%.

In this question you’ll explore better models of this data and different aspects of UGMs.

1. What does w have a length of 9?

2. Write an equation for the model being used by the code.

3. What are potential sources of the problems above?

4. Modify the demo to use a tied value of 0 and re-run the demo. Why does the model now have 36

parameters? Comment on whether this fixes each of the above 3 issues.

5. Modify the demo to use chain-structured dependency (keeping the tied value at 0). Comment on

whether this fixes each of the above 3 issues.

6. Modify the demo to use a completely-connected graph (keeping the tied value at 0). Comment on

whether this fixes each of the above 3 issues.

7. UGM only support pairwise graphs, what would the effect of higher-order potentials be? What would

the disdavantages of higher-order potentials be?

If you want to further explore UGMs, there are quite a few demos on the UGM webpage that you can go

through which cover all sorts of things like approximate inference and CRFs.

2 Bayesian Inference

2.1 Conjugate Priors

Consider a y ∈ {1, 2, 3} following a multinoulli distribution with parameters θ = {θ1, θ2, θ3},

y|θ ∼ Mult(θ1, θ2, θ3).

We’ll assume that θ follows a Dirichlet distribution (the conjugate prior to the multinoulli) with parameters

α = {α1, α2, α3},

θ ∼ D(α1, α2, α3).

1

I got the probabilities from the reverse-engineered heatmap here: http://jemore.free.fr/wordpress/?p=73.

2

Thus we have

p(y|θ, α) = p(y|θ) = θ

I(y=1)

1

θ

I(y=2)

2

θ

I(y=3)

3

, p(θ|α) = Γ(α1 + α2 + α3)

Γ(α1)Γ(α2)Γ(α3)

θ

α1−1

1

θ

α2−1

2

θ

α3−1

3

.

Compute the following quantites:

1. The posterior distribution,

p(θ|y, α).

2. The marginal likelihood of y given the hyper-parameters α,

p(y|α) = Z

p(y, θ|α)dθ,

3. The posterior mean estimate for θ,

Eθ|y,α[θi

] = Z

θip(θ|y, α)dθ,

which (after some manipulation) should not involve any Γ functions.

4. The posterior predictive distribution for a new independent observation ˆy given y,

p(ˆy|y, α) = Z

p(ˆy, θ|y, α)dθ.

Hint: You can use D(α) = Γ(α1)Γ(α2)Γ(α3)

Γ(α1+α2+α3)

to represent the normalizing constant of the prior and D(α

+) to

give the normalizing constant of the posterior. You will also need to use that Γ(α + 1) = αΓ(α). For some

calculations you may find it a bit cleaner to parameterize the posterior in terms of βj = I(y = j) + αj , and

convert back once you have the final result.

2.2 Empirical Bayes

Consider the model

yi ∼ N (w

T φ(xi), σ2

), wj ∼ N (0, λ).

By using properties of Gaussians the marginal likelihood has the form

p(yi

|xi

, σ, λ) = (2π)

−d/2

|C|

−1/2

exp

−

y

T C

−1y

2

,

which gives a negative log-marginal likelihood of

− log p(yi

|xi

, σ, λ) ∝ log |C| + y

T C

−1

y + const.

where

C =

1

σ

2

I +

1

λ

Φ(X)Φ(X)

T

,

As discussed in class, the marginal likelihood can be used to optimize hyper-parameters like σ, λ, and even

the basis φ.

The demo example basis loads a dataset and fits a degree-2 polynomial to it. Normally we would use a test

set to choose the degree of the polynomial but here we’ll use the marginal likelihood of the training set.

3

Write a function, leastSquaresEmpiricalBaysis, that uses the marginal likelihood to choose the degree of the

polynomial as well as the parameters λ and σ (you can assume that all λj are equal, and you can restrict

your search for λ and σ to powers of 10). Hand in your code and report the marginally most likely values of

the degree, σ, and λ. You can use the logdet function to compute the log-determinant.

Hint: computing C

−1y by explicitly forming C

−1 may give you numerical issues that lead to non-sensical

solution. You can avoid these by using y

T C

−1y = y

T v where v is a solution to Cv = y (Matlab will still

give a warning due to ill-conditioning, but it won’t return non-sensical results).

3 Literature Survey

Reading academic papers is a skill that takes practice. When you first start out reading papers, you may

find that you need to re-read things several times before you understand them, or that details will still be

very fuzzy even after you’ve put a great amount of effort into trying to understand a paper. Don’t panic,

this is normal.

Even if you are used to reading papers from your particular sub-area, it can be challenging to read papers

about a completely different topic. Usually, people in different areas use different language/notation and

focus on very different issues. Nevertheless, many of the most-successful people in academia and industry

are those that are able to understand/adapt ideas from different areas. (There are a ton of smart people in

the world working on all sorts of amazing things, it’s good to know how to communicate with as many of

them as possible.)

A common technique when trying to understand a new topic (or reading scientific papers for the first time)

is to read and write notes on 10 papers on the topic. When you read the first paper, you’ll often find that

it’s hard to follow. This can make reading take a long time and might still leave you feeling that many things

don’t make sense; keep reading and trying to take notes. When you get to the second paper, it might still

be very hard to follow. But when you start getting to the 8th or 9th paper, things often start making more

sense. You’ll start to form an impression of what the influential works in the area are, you’ll start getting to

used to the language and jargon, you’ll start to understand what the main issues that people who work on

the topic care about, and you’ll probably notice some important references that weren’t on your initial list

of 10 papers. Ideally, you’ll also start to notice how the topic has changed over time and you may get ideas

of future work that you could do on the topic.

To help you make progress on your project or to give you an excuse to learn about a new topic, for this

part you should write a literature survey of at least 10 academic papers on a particular topic. While

your personal notes on the papers may be longer, the survey should be at most 4 pages of text (excluding

references/tables/figures) in a format similar to the one for this document. Some logical components of a

literature survey might be:

• A description of the overall topic, and the key themes/trends across the papers.

• A short high-level description of what was explored in each paper. For example, describe the problem

being addressed, the key components of the proposed solution, and how it was evaluated. In addition,

it is important to comment on the why questions: why is this problem important and why would this

particular solution method make progress on it? It’s also useful to comment on the strengths and

weaknesses of the various works, and it’s particularly nice if you can show how some works address the

weaknesses of prior works (or introduce new weaknesses).

• One or more logical “groupings” of the papers. This could be in terms of the variant of the topic that

they address, in terms of the solution techniques used, or in chronological terms.

Some advice on choosing the topic:

4

• The most logical/easy topic for your literature survey is a topic related to your course project, given

that your final report will need a (shorter) literature survey included.

• If you are an undergrad, or a masters student without a research project yet, you may alternately want

to choose a general area (like variance-reduced stochastic gradient, non-Gaussian graphical models,

recurrent neural networks, matrix factorization, neural style transfer, Bayesian optimization, etc.) as

your topic.

• If you are a masters student that already has a thesis project, it could make sense to do a survey on a

topic where ML intersects with your thesis (or where ML could intersect your thesis).

• If you are a PhD student, you could use this an excuse to learn about a completely different topic than

what you normally work on. This can be invaluable to your future research, because during your PhD

it’s often hard to allocate time to learn completely new topics.

Bonus: Final Questions

For this question I want you to prepare up to 10 questions that could be added to the final exam. These

questions won’t directly be graded, but for each of your questions that are ultimately used on the exam

you’ll get an extra 1% added to your exam mark. This has the added advantage that you may know what

some of the final questions are going to be. However, please do not collude by sharing the questions you

submit with other groups: I would treat this as academic dishonesty and it could lead to a mark of 0 on the

final.

In terms of topics, keep in mind that we are only testing on the topics covered in the assignments. This

means that questions can be on topics like cross-validation, coordinate optimization, and Viterbi decoding.

But questions should not be on things like ICA, neural networks, or treewidth.

In terms of difficulty, I am aiming to have a few “easier” questions that test general knowledge, and a couple

questions that require more thinking. I am not looking for trick/language questions that require a particular

interpretation of the phrasing of the question or questions that require reading large amounts of text.

In terms of questions, we are looking for questions that can be graded quickly since we have limited TA

hours remaining. Given this constraint, some possible question formats are:

• True/false.

• Multiple choice.

• Select all that apply.

• Matching of items to concepts or definitions.

• Short calculations.

5