Description
1 Undirected Graphical Models
1.1 Conditional UGM
Consdier modeling the dependencies between sets of binary variables xj and yj with the following UGM
which is a variation on a stacked RBM:
Computing univariate marginals in this model will be NP-hard in general, but the graph structure allows
efficient block updates by conditioning on suitable subsets of the variables (this could be useful for designing
approximate inference methods). For each of the conditioning scenarios below, draw the conditional UGM
and comment on how expensive it would be to compute univariate marginals in the conditional UGM.
1. Conditioning on all the z and h values.
2. Conditioning on all the x and h values.
3. Conditioning on all the z and y values.
4. Conditioning on all the x and z values.
1
1.2 Fitting a UGM to PINs
The function example UGM loads a dataset X containing samples of PIN numbers, based on the probabilities
from the article at this URL: http://www.datagenetics.com/blog/september32012.
1
This function shows how to use the UGM software to fit a UGM model to the dataset, where all node/edge
parameters are tied and the graph is empty. It then performs decoding/inference/sampling in the fitted
model. Unfortunately, this is not a very good model of the data for several reasons:
1. The decoding is 1 1 1 1, whereas in the data the most likely value by far is 1 2 3 4. Similarly, the
sampler doesn’t tend to generate 1 2 3 4 even though this happens in more than 1/10 samples.
2. The marginal probability of the first number being 1 is 22.06%, which is acutally too low (it should
be 38.54%). In addition, the marginal probabilities of the remaining nubmers being 1 are also 22.06%,
and these numbers are too high.
3. Conditioned on the first three numbers being 1 2 3, the probability that the last number is 4 is less
than 10% in the model, whereas in the data it’s more than 90%.
In this question you’ll explore better models of this data and different aspects of UGMs.
1. What does w have a length of 9?
2. Write an equation for the model being used by the code.
3. What are potential sources of the problems above?
4. Modify the demo to use a tied value of 0 and re-run the demo. Why does the model now have 36
parameters? Comment on whether this fixes each of the above 3 issues.
5. Modify the demo to use chain-structured dependency (keeping the tied value at 0). Comment on
whether this fixes each of the above 3 issues.
6. Modify the demo to use a completely-connected graph (keeping the tied value at 0). Comment on
whether this fixes each of the above 3 issues.
7. UGM only support pairwise graphs, what would the effect of higher-order potentials be? What would
the disdavantages of higher-order potentials be?
If you want to further explore UGMs, there are quite a few demos on the UGM webpage that you can go
through which cover all sorts of things like approximate inference and CRFs.
2 Bayesian Inference
2.1 Conjugate Priors
Consider a y ∈ {1, 2, 3} following a multinoulli distribution with parameters θ = {θ1, θ2, θ3},
y|θ ∼ Mult(θ1, θ2, θ3).
We’ll assume that θ follows a Dirichlet distribution (the conjugate prior to the multinoulli) with parameters
α = {α1, α2, α3},
θ ∼ D(α1, α2, α3).
1
I got the probabilities from the reverse-engineered heatmap here: http://jemore.free.fr/wordpress/?p=73.
2
Thus we have
p(y|θ, α) = p(y|θ) = θ
I(y=1)
1
θ
I(y=2)
2
θ
I(y=3)
3
, p(θ|α) = Γ(α1 + α2 + α3)
Γ(α1)Γ(α2)Γ(α3)
θ
α1−1
1
θ
α2−1
2
θ
α3−1
3
.
Compute the following quantites:
1. The posterior distribution,
p(θ|y, α).
2. The marginal likelihood of y given the hyper-parameters α,
p(y|α) = Z
p(y, θ|α)dθ,
3. The posterior mean estimate for θ,
Eθ|y,α[θi
] = Z
θip(θ|y, α)dθ,
which (after some manipulation) should not involve any Γ functions.
4. The posterior predictive distribution for a new independent observation ˆy given y,
p(ˆy|y, α) = Z
p(ˆy, θ|y, α)dθ.
Hint: You can use D(α) = Γ(α1)Γ(α2)Γ(α3)
Γ(α1+α2+α3)
to represent the normalizing constant of the prior and D(α
+) to
give the normalizing constant of the posterior. You will also need to use that Γ(α + 1) = αΓ(α). For some
calculations you may find it a bit cleaner to parameterize the posterior in terms of βj = I(y = j) + αj , and
convert back once you have the final result.
2.2 Empirical Bayes
Consider the model
yi ∼ N (w
T φ(xi), σ2
), wj ∼ N (0, λ).
By using properties of Gaussians the marginal likelihood has the form
p(yi
|xi
, σ, λ) = (2π)
−d/2
|C|
−1/2
exp
−
y
T C
−1y
2
,
which gives a negative log-marginal likelihood of
− log p(yi
|xi
, σ, λ) ∝ log |C| + y
T C
−1
y + const.
where
C =
1
σ
2
I +
1
λ
Φ(X)Φ(X)
T
,
As discussed in class, the marginal likelihood can be used to optimize hyper-parameters like σ, λ, and even
the basis φ.
The demo example basis loads a dataset and fits a degree-2 polynomial to it. Normally we would use a test
set to choose the degree of the polynomial but here we’ll use the marginal likelihood of the training set.
3
Write a function, leastSquaresEmpiricalBaysis, that uses the marginal likelihood to choose the degree of the
polynomial as well as the parameters λ and σ (you can assume that all λj are equal, and you can restrict
your search for λ and σ to powers of 10). Hand in your code and report the marginally most likely values of
the degree, σ, and λ. You can use the logdet function to compute the log-determinant.
Hint: computing C
−1y by explicitly forming C
−1 may give you numerical issues that lead to non-sensical
solution. You can avoid these by using y
T C
−1y = y
T v where v is a solution to Cv = y (Matlab will still
give a warning due to ill-conditioning, but it won’t return non-sensical results).
3 Literature Survey
Reading academic papers is a skill that takes practice. When you first start out reading papers, you may
find that you need to re-read things several times before you understand them, or that details will still be
very fuzzy even after you’ve put a great amount of effort into trying to understand a paper. Don’t panic,
this is normal.
Even if you are used to reading papers from your particular sub-area, it can be challenging to read papers
about a completely different topic. Usually, people in different areas use different language/notation and
focus on very different issues. Nevertheless, many of the most-successful people in academia and industry
are those that are able to understand/adapt ideas from different areas. (There are a ton of smart people in
the world working on all sorts of amazing things, it’s good to know how to communicate with as many of
them as possible.)
A common technique when trying to understand a new topic (or reading scientific papers for the first time)
is to read and write notes on 10 papers on the topic. When you read the first paper, you’ll often find that
it’s hard to follow. This can make reading take a long time and might still leave you feeling that many things
don’t make sense; keep reading and trying to take notes. When you get to the second paper, it might still
be very hard to follow. But when you start getting to the 8th or 9th paper, things often start making more
sense. You’ll start to form an impression of what the influential works in the area are, you’ll start getting to
used to the language and jargon, you’ll start to understand what the main issues that people who work on
the topic care about, and you’ll probably notice some important references that weren’t on your initial list
of 10 papers. Ideally, you’ll also start to notice how the topic has changed over time and you may get ideas
of future work that you could do on the topic.
To help you make progress on your project or to give you an excuse to learn about a new topic, for this
part you should write a literature survey of at least 10 academic papers on a particular topic. While
your personal notes on the papers may be longer, the survey should be at most 4 pages of text (excluding
references/tables/figures) in a format similar to the one for this document. Some logical components of a
literature survey might be:
• A description of the overall topic, and the key themes/trends across the papers.
• A short high-level description of what was explored in each paper. For example, describe the problem
being addressed, the key components of the proposed solution, and how it was evaluated. In addition,
it is important to comment on the why questions: why is this problem important and why would this
particular solution method make progress on it? It’s also useful to comment on the strengths and
weaknesses of the various works, and it’s particularly nice if you can show how some works address the
weaknesses of prior works (or introduce new weaknesses).
• One or more logical “groupings” of the papers. This could be in terms of the variant of the topic that
they address, in terms of the solution techniques used, or in chronological terms.
Some advice on choosing the topic:
4
• The most logical/easy topic for your literature survey is a topic related to your course project, given
that your final report will need a (shorter) literature survey included.
• If you are an undergrad, or a masters student without a research project yet, you may alternately want
to choose a general area (like variance-reduced stochastic gradient, non-Gaussian graphical models,
recurrent neural networks, matrix factorization, neural style transfer, Bayesian optimization, etc.) as
your topic.
• If you are a masters student that already has a thesis project, it could make sense to do a survey on a
topic where ML intersects with your thesis (or where ML could intersect your thesis).
• If you are a PhD student, you could use this an excuse to learn about a completely different topic than
what you normally work on. This can be invaluable to your future research, because during your PhD
it’s often hard to allocate time to learn completely new topics.
Bonus: Final Questions
For this question I want you to prepare up to 10 questions that could be added to the final exam. These
questions won’t directly be graded, but for each of your questions that are ultimately used on the exam
you’ll get an extra 1% added to your exam mark. This has the added advantage that you may know what
some of the final questions are going to be. However, please do not collude by sharing the questions you
submit with other groups: I would treat this as academic dishonesty and it could lead to a mark of 0 on the
final.
In terms of topics, keep in mind that we are only testing on the topics covered in the assignments. This
means that questions can be on topics like cross-validation, coordinate optimization, and Viterbi decoding.
But questions should not be on things like ICA, neural networks, or treewidth.
In terms of difficulty, I am aiming to have a few “easier” questions that test general knowledge, and a couple
questions that require more thinking. I am not looking for trick/language questions that require a particular
interpretation of the phrasing of the question or questions that require reading large amounts of text.
In terms of questions, we are looking for questions that can be graded quickly since we have limited TA
hours remaining. Given this constraint, some possible question formats are:
• True/false.
• Multiple choice.
• Select all that apply.
• Matching of items to concepts or definitions.
• Short calculations.
5