Description
In class, the form of the Expectation-Maximization (EM) algorithm we studied is the most
common one, and is sometimes called “soft EM” since the E-step computes soft assignments of
the hidden variables.
There is a variant called “hard EM” where the E-step is replaced by a hard
assignemnt of the hidden variables.
Specifically, suppose the observed data point is denoted by X ∈ X , the hidden variable is
denoted by Y ∈ [k] = {1, 2, . . . , k}, and their joint distribution is obtained from a parameterized
model with parameters denoted by θ. Let Prθ[·] denote probabilities in the distribution with
parameters θ.
The (soft) EM algorithm as we studied it in class repeats the following two steps until convergence, starting from an arbitrary initialization of the parameters θ:
• E-step: For each training example x
(i) and for all j ∈ [k], compute
q
(i)
j = Pr
θ
[Y = j | X = x
(i)
]
• M-step: Update θ as follows:
θ ← arg max
θ
0
Xn
i=1
X
k
j=1
q
(i)
j
· ln Pr
θ
0
[X = x
(i)
, Y = j]
In hard EM, the E-step above is replaced by a hard assignment of the hidden variable to the
most likely value for the given data point under the current distribution (i.e. with parameters θ).
We will call this the E’-step. The M-step stays the same.
The hard EM algorithm repeats the following two steps until convergence, starting from an
arbitrary initialization of the parameters θ:
• E’-step: For every training example x
(i)
, compute
j
(i) = arg max
j∈[k]
Pr
θ
[Y = j | X = x
(i)
],
breaking ties arbitrarily, and set
q
(i)
j =
(
1 if j = j
(i)
0 otherwise.
• M-step: The same as the M-step in the soft EM algorithm.
Now consider implementing the hard EM algorithm for the following specific scenario. We have
a Gaussian mixture model where all the k Gaussians have the same covariance matrix, I, the
identity matrix. Thus, in particular, the parameters of the model are
θ = (π1, µ1, π2, µ2, . . . , πk, µk),
where πj are the mixing weights, and µj are the centers of the Gaussians. Samples are generated
from this mixture model exactly as in class (viz. as described on slide 4 of the lecture.) The setting
is a bit simpler in that covariance matrices are already known (i.e. I).
Part 1 of assignment (weightage: 4%). For the mixture of Gaussians setting described
above, work out the precise implementation of the E’- and M-steps.
Part 2 of assignment (weightage: 1%). The hard EM algorithm you derived above should
be almost (but not exactly) the same as another algorithm we have already studied in class. Which
algorithm? What is the difference between the two algorithms?