Description
1.1 Mean-Field Approximation for Multivariate Gaussians
In this question, we’ll explore how accurate a Mean-Field approximation can be for an underlying multivariate Gaussian distribution.
Assume we have observed data X = {x
(i)}
n
i=1 that was drawn from a 2-dimensional Gaussian distribution
p(x; µ, Λ−1
).
p(x; µ, Λ) = N
x1
x2
;
µ1
µ2
,
Λ11 Λ12
Λ21 Λ22−1
(1.1)
Note here that we’re using the precision matrix Λ = Σ−1
. An additional property of the precision matrix is
that it is symmetric, so Λ12 = Λ21. This will make your lives easier for the math to come.
We will approximate this 2-dimensional Gaussian with a mean field approximation, q(x) = q(x1)q(x2), the
product of two 1-dimensional distributions q(x1) and q(x2). For now, we won’t assume any form for this
distributions.
1. (1 point) Short Answer: Write down the equation for log p(X). For now, you can leave all of the
parameters in terms of vectors and matrices, not their subcomponents.
2. (2 points) Short Answer: Group together everything that involves X1 and remove anything involving
X2. We claim that there exists some distribution q
∗
(X) = q
∗
(X1)q
∗
(X2) that minimizes the KL
divergence q
∗ = argminq KL(q||p). And further, said distribution will have a component q
?
(X1) will
be proportional to the quantity you find below.
It can be shown that this implies that q(X1) (and therefore q(X2)) is a Gaussian distribution.
q(x1) = N (x1; m1,Λ
−1
11 )
Where m1 = µ1 − Λ
−1
11 Λ12(E[x2] − µ2)
Using these facts, we’d like to explore how well our approximation can model the underlying distribution.
3. Suppose the parameters of the true distribution are µ =
0
0
and Λ =
1 0
0 1/4
.
2 of 10
Homework 5: Variational Inference 10-418 / 10-618
(a) (1 point) Numerical Answer: What is the value of the mean of the Gaussian for q
∗
(X1)?
(b) (1 point) Numerical Answer: What is the value of the variance of the Gaussian for q
∗
(X1)?
(c) (1 point) Numerical Answer: What is the value of the mean of the Gaussian for q
∗
(X2)?
(d) (1 point) Numerical Answer: What is the value of the variance of the Gaussian for q
∗
(X2)?
(e) (2 points) Plot: Provide a computer-generated contour plot to show the result of our approximation q
∗
(X) and the true underlying Gaussian p(X; µ, Λ) for the parameters given above.
4. Suppose the parameters of the true distribution are µ =
1
2
and Λ =
1 −3
0 1
.
(a) (1 point) Numerical Answer: What is the value of the mean of the Gaussian for q
∗
(X1)?
(b) (1 point) Numerical Answer: What is the value of the variance of the Gaussian for q
∗
(X1)?
3 of 10
Homework 5: Variational Inference 10-418 / 10-618
(c) (1 point) Numerical Answer: What is the value of the mean of the Gaussian for q
∗
(X2)?
(d) (1 point) Numerical Answer: What is the value of the variance of the Gaussian for q
∗
(X2)?
(e) (2 points) Plot: Provide a computer-generated contour plot to show the result of our approximation q
∗
(X) and the true underlying Gaussian p(X; µ, Λ) for the parameters given above.
5. (1 point) Describe in words how the plots you generated provide insight into the behavior of minimization of KL(q||p) with regards to the low probability and high probability regions of the the true vs.
approximate distributions.
1.2 Variational Inference for Gaussian Mixture Models
Now that we have seen how the mean-field approximation works for a multivariate Gaussian, let’s look at the
case of Gaussian Mixture Models. Suppose we have a Bayesian mixture of unit-variance univariate Gaussian
distributions. This mixture consists of 2 components each corresponding to a Gaussian distribution, with
means µ = {µ1, µ2}. The mean parameters are drawn independently from a Gaussian prior distribution
N (0, σ2
). The prior variance σ
2
is a hyperparameter. Generating an observation xi from this model is done
4 of 10
Homework 5: Variational Inference 10-418 / 10-618
according to the following generative story:
1. Choose a cluster assignment ci for the observation. The cluster assignment is chosen from the distribution Categorical(
1
2
,
1
2
) and indicates which latent cluster xi comes from. Encode ci as a one-hot
vector where [1, 0] indicates that xi
is assigned to cluster 0 and vice versa.
2. Generate xi from the corresponding Gaussian distribution N (c
T
i µ, 1)
The complete hierarchical model is as follows:
µk ∼ N (0, σ2
), k ∈ {1, 2}
ci ∼ Categorical(
1
2
,
1
2
), i ∈ [1, n]
xi
|ci
, µ ∼ N (c
T
i µ, 1), i ∈ [1, n]
where n is the number of observations generated from the model.
1. (1 point) What are the observed and latent variables for this model?
2. (1 point) Write down the joint probability of observed and latent variables under this model
3. (3 points) Let’s calculate the ELBO (evidence lower-bound) for this model. Recall that the ELBO is
given by the following equation:
ELBO(q) = Eq[log p(x, z)] − Eq[log q(z)]
To calculate q(z), we will now use the mean-field assumption. Under this assumption, each latent
variable is governed by its own latent factor, resulting in the following probability distribution:
q(µ, c) = Y
2
k=1
q(µk; mk, v2
k
)
! Yn
i=1
q(ci
; ai)
!
Here q(µk; mk, v2
k
) is the Gaussian distribution for the k-th mixture component with mean and variance
mk and v
2
k
. q(ci
; ai) is the categorical distribution for the i-th observation with assignment probabilities
ai (ai
is a 2-dimensional vector). Given this assumption, write down the ELBO as a function of the
variational parameters m, v
2
, a.
5 of 10
Homework 5: Variational Inference 10-418 / 10-618
4. Now that we have the ELBO formulation, let’s try to compute coordinate updates for our latent variables.
Remember that the optimal variational density of a latent variable zi
is proportional to the exponentiated
expected log of the complete conditional given all other latent variables in the model and the observed
data. In other words:
qi(zi) ∝ exp
E−j [log p(zj |z−j , x)]!
Equivalently, you can also say that the variational density is proportional to the exponentiated expected
log of the joint E−j [log p(zj , z−j , x)]. This is a valid coordinate update since the expectations on the
right side of the equation do not involve zj due to the mean-field assumption.
(a) (4 points) Show that the variational update for ai1 ∝ exp
E[µ1; m1, s2
1
]xi −
E[µ1;m1,s2
1
]
2
!
.
(Hint: We can write the optimal variational density for cluster assignment variables as
q(ci
; ai1) ∝ exp
log p(ci) + Eµ[log p(xi
|ci
, µ); m, v
2
]
!
. Feel free to drop added constants
along the way.)
6 of 10
Homework 5: Variational Inference 10-418 / 10-618
(b) (6 points) Show that the variational updates for the k-th mixture component are mk =
P
i
aikxi
1/σ2+
P
i
aik
and vk =
1
1/σ2+
P
i
aik
.
(Hint: We can write the optimal variational density for the k-th mixture component as
q(µk) ∝ exp
log p(µk) + Eci
[log p(xi
|ci
, µ); ai
, m−k, v
2
−k
]
!
. Feel free to drop added constants
along the way.)
1.3 Running CAVI: Toy Example
Let’s now see this in action!
Recall that the CAVI update algorithm for a Gaussian Mixture Model is as follows:
7 of 10
Homework 5: Variational Inference 10-418 / 10-618
Note that our notation differs slightly, with ϕ corresponding to a and s
2
corresponding to v
2
. We also have
K = 2. Assume initial parameters, m = [0.5, 0.5], v
2 = [1, 1] and ai = [0.3, 0.7] for all i ∈ n and a sample
x = [0.1, −0.3, 1.2, 0.8, −0.5]. Also assume prior variance σ
2 = 0.01
Write a python script implementing the above procedure and run it for 5 epochs. You should submit your
code to autolab as a .tar file named cavi.tar containing a single file cavi.py. You can create that file by
running:
tar -cvf cavi.tar cavi.py
from the directory containing your code.
After the fifth epoch, report
1. (2 points) The variational parameters m.
m
2. (2 points) The variational parameters v
2
.
v
2
3. (2 points) The variational parameters a.
a1
a2
a3
a4
a5
Hint:
1. Note that the expectation update for a does not depend on µ. (Why?)
2. The expectation of the square of a Gaussian random variable is E[X2
] = V ar[X] + E([X])2
.
8 of 10
Homework 5: Variational Inference 10-418 / 10-618
1.4 Variational Inference vs. Monte Carlo Methods
Let’s end with a brief comparison between variational methods and MCMC methods. We have seen that
both classes of methods can be used for learning in scenarios involving latent variables, but both have their
own sets of advantages and disadvantages. For each of the following statements, specify whether they apply
more suitably to VI or MCMC methods:
1. (1 point) Transforms inference into optimization problems.
Variational Inference
MCMC
2. (1 point) Is easier to integrate with back-propagation.
Variational Inference
MCMC
3. (1 point) Involves more stochasticity.
Variational Inference
MCMC
4. (1 point) Non-parametric.
Variational Inference
MCMC
5. (1 point) Is higher variance under limited computational resources.
Variational Inference
MCMC
1.5 Wrap-up Questions
1. (1 point) Multiple Choice: Did you correctly submit your code to Autolab?
Yes
No
2. (1 point) Numerical answer: How many hours did you spend on this assignment?.
9 of 10
Homework 5: Variational Inference 10-418 / 10-618
1.6 Collaboration Policy
After you have completed all other components of this assignment, report your answers to the collaboration
policy questions detailed in the Academic Integrity Policies for this course.
1. Did you receive any help whatsoever from anyone in solving this assignment? If so, include full
details including names of people who helped you and the exact nature of help you received.
2. Did you give any help whatsoever to anyone in solving this assignment? If so, include full details
including names of people you helped and the exact nature of help you offered.
3. Did you find or come across code that implements any part of this assignment? If so, include full
details including the source of the code and how you used it in the assignment.
10 of 10