# ECSE 343 Assignment 2: Advanced Model Fitting

\$30.00

## Description

The learning objectives of this assignment are to
explore and implement a:
maximum likelihood estimator,
maximum a posteriori estimator, and
the expectation-maximization algorithm.
In doing so, you’ll revisit the polynomial fitting problem before moving on to a more sophisticated mixture model
classification problem.
1 Polynomial regression with MLE and MAP [40%]
We begin by exploring a maximum likelihood estimator (MLE) and maximum a posteriori (MAP) estimator on the
familiar polynomial regression problem.
Comparing degree-7 polynomial fits to noisy data and across estimators.
1.1 Polynomial regression
The overdetermined degree- polynomial regression problem — with an explicit corruption/noise model on the
data — seeks an interpolant across data samples that satisfies:
where:
is the independent coordinate of sample , with ,
is the dependent coordinate of sample , with ,
is standard Gaussian noise corrupting the outputs , and
are the polynomial coefficients we are solving for.
Note that one common way to rewrite this model is by “folding in” the deterministic component into the mean of
the Gaussian, as:
1.2 Maximum likelihood estimation (MLE) [20%]
You will implement a MLE for the polynomial parameters that maximize the data’s likelihood function:
where — assuming the samples are drawn i.i.d. — the likelihood function can be expressed using the
normal distribution’s density, as
Taking the log of the likelihood before taking the argmax — which is a valid transformation under argmax, given
log’s monotonicity — yields:
which we optimize for by setting the derivatives w.r.t. the parameters to zero,
before combining these partial derivatives to form a linear system , with
𝑚
𝑛 (𝑥𝑖, 𝑦𝑖)
𝑦𝑖 = ∑ + (𝑖 = 1,…, 𝑛)
𝑗=0
𝑚
𝜃𝑗𝑥
𝑗
𝑖 𝜖𝑖
𝑥𝑖 𝑖 𝐱 = {𝑥0 …𝑥𝑛−1}
𝑦𝑖 𝑖 𝐲 = {𝑦0 …𝑦𝑛−1}
𝜖𝑖 ∼ (𝜇, 𝜎 )
2 𝑦𝑖
𝜽 = {𝜃0, 𝜃1,…, 𝜃𝑚} 𝑝 = 𝑚 + 1
∼ 
(
𝜇 + , )
𝑦 . 𝑖 ∑
𝑗=0
𝑚
𝜃𝑗𝑥
𝑗
𝑖 𝜎
2
𝜽
𝜽MLE := argmax 𝑝(𝐲|𝐱, 𝜽) ,
𝜽
(𝑥𝑖, 𝑦𝑖)
𝑝(𝐲|𝐱, 𝜽) = ∏𝑝( | , 𝜽) = exp − .
𝑖=0
𝑛−1
𝑦𝑖 𝑥𝑖 ∏
𝑖=0
𝑛−1
1
2𝜋𝜎
2 √‾‾‾‾‾

1
2𝜎
2 [
− 𝜇 −
]
𝑦𝑖 ∑
𝑗=0
𝑚
𝜃𝑗𝑥
𝑗
𝑖
2⎞

log(𝑝(𝐲|𝐱, 𝜽)) = log(
𝑝( | , 𝜽)) ∏ = −𝑛 log( )) −
𝑖=0
𝑛−1
𝑦𝑖 𝑥𝑖 2𝜋𝜎
2 √‾‾‾‾‾
1
2𝜎
2 ∑
𝑖=0
𝑛−1
[
− 𝜇 −
]
𝑦𝑖 ∑
𝑗=0
𝑚
𝜃𝑗𝑥
𝑗
𝑖
2
log(𝑝(𝐲|𝐱, 𝜽)) =
[
− 𝜇 −
]
= 0 ,

∂𝜃𝑘
1
𝜎
2 ∑
𝑖=0
𝑛−1
𝑥
𝑘
𝑖 𝑦𝑖 ∑
𝑗=0
𝑚
𝜃𝑗𝑥
𝑗
𝑖
𝐀 𝜽MLE = 𝐛
Solving this system yields the MLE for .
1.3 Maximum a posteriori (MAP) [20%]
Maximum a posteriori (MAP) estimators consider an additional prior over the parameters and solve for
parameters that maximize the (log) posterior distribution of the parameters, given the data:
In the polynomial regression setting, we will assume that the are drawn i.i.d from a normal distribution
, so their joint prior probability density is given by:
The derivative with respect to each parameter of is
and we combine these partials together to form a linear system that we can solve for the MAP
estimate of the parameters, where
𝐴𝑖,𝑗 = ∑ and = ( − 𝜇) ; 𝑖,𝑗 ∈ [0,𝑀]
𝑠=0
𝑛−1
𝑥
𝑖+𝑗
𝑠 𝑏𝑖 ∑𝑠=0
𝑛−1
𝑥
𝑖
𝑠 𝑦𝑠
𝜽MLE
Deliverable 1 [20%]
Complete the implementation of the function PolRegMLE that takes inputs (equal to ), , ,
and and outputs the system matrix and RHS for the MLE. Notice how the system does not
depend on , as discussed in class.

𝑝 𝑚 + 1 𝜇 𝐱
𝐲 𝐀 𝐛
𝜎
Unlike the first part of A1, deliverables 1, 2 and 3 below won’t solve the linear system, but only form it.
You can explore numpy’s many linear system solving routines (e.g, numpy.linalg.solve) in order to
compute and assess the solutions of your various estimator systems.

𝜽∗
𝜽MAP
𝜽MA := 𝑝(𝜽|𝐲, 𝐱) = 𝑝(𝐲|𝐱, 𝜽) 𝑝(𝜽) . P argmax
𝜽
argmax
𝜽
𝜽
(𝜇 , ) 𝜽 𝜎
2
𝜽
𝑝(𝜽) = 𝑝( ) = exp (
− ( − ) ∏ .
𝑘=0
𝑚
𝜃𝑘 ∏
𝑘=0
𝑚
1
2𝜋𝜎
2
𝜽
‾‾‾‾‾ √
1
2𝜎
2
𝜽
𝜃𝑘 𝜇𝜽)
2
𝜃𝑘
log(𝑝(𝐲|𝐱, 𝜽) 𝑝(𝜽)) ≡ 𝑙(𝜽|𝐱, 𝐲)
𝑙(𝜽|𝐱, 𝐲) =
[
− 𝜇 −
]
− ( − ) ,

∂𝜃𝑘
1
𝜎
2 ∑
𝑖=0
𝑛−1
𝑥
𝑘
𝑖 𝑦𝑖 ∑
𝑗=0
𝑚
𝜃𝑗𝑥
𝑗
𝑖
1
𝜎
2
𝜽
𝜃𝑘 𝜇𝜽
𝐀𝜽MAP = 𝐛
𝐴𝑖,𝑗 = and = ( − 𝜇) + ; 𝑖,𝑗 ∈ [0, 𝑚]

∑ +
𝑠=0
𝑛−1
𝑥
𝑖+𝑗
𝑠
𝜎
2
𝜎
2
𝜽
∑𝑠=0
𝑛−1
𝑥
𝑖+𝑗
𝑠
if 𝑖 = 𝑗
otherwise
𝑏𝑖 ∑𝑠=0
𝑛−1
𝑥
𝑖
𝑠 𝑦𝑠 𝜇𝜽
𝜎
2
𝜎
2
𝜽
Deliverable 2 [15%]
Complete the implementation of the function PolRegMAP with inputs , , , , , , and
and that outputs the MAP linear system matrix and RHS .

𝑝 = 𝑚 + 1 𝜇 𝜎 𝜇𝜽 𝜎𝜽 𝐱
𝐲 𝐀 𝐛
As discussed in class, the linear least squares solution to the polynomial regression problem is equivalent to the
MLE when the noise is assumed to be normally distributed with zero mean. As such, we should expect that MLE
and LLS yield the same parameters — with the exception of the constant term — when the noise is not of zero
mean.
2 Mixture Models with Expectation-Maximization [60%]
We will now explore the expectation-maximization (EM) algorithm through the lens of a Gaussian mixture model
(GMM) example.
Fitting a GMM to a randomly generated, multi-modal 2D dataset using the EM algorithm.
Consider many -dimensional data samples drawn from a mixture model parameterized by .
Our example will use a mixture of multivariate normal distributions (i.e., high-dimensional, anisotropic
Gaussians), with mixture components, and so our , where
each of the -dimensional Gaussians are parameterized by a -dimensional mean vector , a
covariance matrix (see below), and
the normalized mixture proportions weight the amount of each of the components in the mixture, with
.
We use the standard definition of a -dimensional multivariate normal distribution, with density:
𝜃0
Deliverable 3 [5%]
Revisit LLS by completing the implementation of PolRegLLS with inputs , , and and
that outputs the appropriate fully-constrained linear system matrix and RHS needed to construct
the LLS system.

𝑝 = 𝑚 + 1 𝑛 𝐱 𝐲
𝐀 𝐛
Our evaluation of your GMM code will impose a (sufficiently lenient) time limit on the execution your
algorithm — be mindful to avoid loops and to leverage numpy vectorization whenever suitable. ⓘ
𝑑 𝐱𝑖 𝜽
(𝐱|𝝁,𝚺)
𝐾 𝜽 = {𝝁0,…, 𝝁𝐾−1,𝚺0,…,𝚺𝐾−1, 𝜋0,…, 𝜋𝐾−1}
𝐾 𝑑 𝑑 𝝁𝑖 𝑑 × 𝑑
𝚺𝑖
𝜋𝑐 𝐾
∑ = 1
𝐾−1
𝑐=0 𝜋𝑐
𝑑
(𝐱|𝝁,𝚺) = exp[− (𝐱 − 𝝁 (𝐱 − 𝝁)]
,
1
(2𝜋)
𝑑/2
|𝚺|
1/2
1
2
)
⊺𝚺
−1
and so the marginal distribution of is:
As such, the joint probability of i.i.d. data samples is
And so, given the samples and knowing only the number of mixture components , our goal is to estimate the
mixture’s parameters that best match the data: the means (i.e., centers), covariances (i.e., anisotropic
“shapes”), and weights of the mixture components.
2.1 From MLE to Expectation-Maximization
A natural first step is to derive an MLE for . The likelihood function is simply the joint probability above,
and so the log-likelihood is
We immediately run into difficulties when differentiating with respect to, say, :
as we are unable to isolate the : the non linearity induced by the
term. We face similar problems when differentiating with respect
to other parameters in .
To sidestep this issue, we will introduce the concept of a latent label for each data sample :
this latent label corresponds to the ground truth association of each data sample to its corresponding mixture
component, and so .
By the law of total probability, the marginal conditional probability distribution of given can now be rewritten as
Intuitively, knowing the latent variables should help us in computing the MLE. To do so, we first compute the
vector posterior with elements using Bayes’ theorem:
𝐱𝑖
𝑝(𝐱 |𝜽) = ( | , ) . 𝑖 ∑𝑐=0
𝐾−1
𝜋𝑐 𝐱𝑖 𝝁𝑐 𝚺𝑐
𝑛 ˆ𝐱 = {𝐱0,…, 𝐱𝑛−1}
𝑝(ˆ𝐱|𝜽) = ∏ ( | , ) .
𝑖=0
𝑛−1
∑𝑐=0
𝐾−1
𝜋𝑐 𝐱𝑖 𝝁𝑐 𝚺𝑐
ˆ𝐱 𝐾
𝜽 𝝁𝑐 𝚺𝑐
𝜋𝑐
𝜽 𝑝(ˆ𝐱|𝜽)
log 𝑝( |𝜽) = log(
( | , ))
ˆ𝐱 ∑ .
𝑖=0
𝑛−1
∑𝑐=0
𝐾−1
𝜋𝑐 𝐱𝑖 𝝁𝑐 𝚺𝑐
𝝁𝑘
log 𝑝( |𝜽) = ( − ) = 0 ,

∂𝝁𝑗
ˆ𝐱 ∑
𝑖=0
𝑛−1

𝜋𝑗(𝐱𝑖|𝝁𝑗
,𝚺𝑗)
∑ ( | , )
𝑐=0
𝐾−1
𝜋𝑐 𝐱𝑖 𝝁𝑐 𝚺𝑐

𝐱𝑖 𝝁𝑗 𝚺
−1
𝑗
𝝁𝑘
( ( | , )) / ( ( | , ))
𝜋𝑗 𝐱𝑖 𝝁𝑗 𝚺𝑗 ∑
𝑐=0
𝐾−1
𝜋𝑐 𝐱𝑖 𝝁𝑐 𝚺𝑐
𝜽
𝑧𝑖 ∈ {1,…, 𝐾} 𝐱𝑖
𝜋𝑐 = 𝑝(𝑧𝑖 = 𝑐|𝜽)
𝐱𝑖 𝜽
𝑝(𝐱 |𝜽) = . 𝑖 ∑𝑐=0
𝐾−1
𝑝(𝑧𝑖 = 𝑐|𝜽)

𝜋𝑐
𝑝(𝐱𝑖|𝑧𝑖 = 𝑐, 𝜽)

(𝐱𝑖|𝝁 , ) 𝑐 𝚺𝑐
𝑧𝑖
𝜶 𝛼𝑖,𝑗 = 𝑝(𝑧𝑖 = 𝑘|𝐱𝑖, 𝜽)
We can now rewrite the partial derivative of the log-likelihood with respect to and set it to zero as:
Even though depends on , we can define an iterative process that assumes we have an approximate value
of and solving for as:
where can be interpreted as the effective number of data samples assigned to the mixture
component . We can similarly obtain expressions for the other parameters as:
After computing these (updated) values for the , and , we can recompute new estimates for from the
expression we derived using Bayes’ theorem. We repeat this process until “convergence”.
2.2 Formalizing the EM Algorithm
To summarize:
if we knew the parameters , we could compute the posterior probabilities , and
if we knew the posteriors , we could compute the parameters .
The EM algorithm proceeds as follows:
0. Initialize the parameters and evaluate the log-likelihood with these parameters,
1. E-step: evaluate the posterior probabilities using the current values of the parameters ,
2. M-step: estimate new parameters with the current values of ,
3. Evaluate the log-likelihood with the new parameter estimates — if the log-likelihood has
changed by less than some (typically small) convergence threshold, terminate; otherwise, repeat steps 1
through 3.
𝛼 := 𝑝( = 𝑗| , 𝜽) = = . 𝑖,𝑗 𝑧𝑖 𝐱𝑖
𝑝(𝑧𝑖 = 𝑗|𝜽)𝑝(𝐱𝑖|𝑧𝑖 = 𝑗, 𝜽)
𝑝(𝐱𝑖|𝜽)
𝜋𝑗(𝐱𝑖|𝝁𝑗
,𝚺𝑗)
∑ ( | , )
𝑐=0
𝐾−1
𝜋𝑐 𝐱𝑖 𝝁𝑐 𝚺𝑐
𝝁𝑗
log 𝑝( |𝜽) = ( − ) = 0 .

∂𝝁𝑗
ˆ𝐱 ∑
𝑖=0
𝑛−1
𝛼𝑖,𝑗 𝐱𝑖 𝝁𝑗 𝚺
−1
𝑗
𝛼𝑖,𝑗 𝝁𝑗
𝛼𝑖,𝑗 𝝁𝑗
𝝁𝑗 =
1
𝜔𝑗 ∑
𝑖=0
𝑛−1
𝛼𝑖,𝑗 𝐱𝑖
𝜔𝑗 = ∑
𝑛−1
𝑖=0 𝛼𝑖,𝑗
𝑗
𝚺 = ( − ( − ) and = . 𝑗
1
𝜔𝑗 ∑
𝑖=0
𝑛−1
𝛼𝑖,𝑗 𝐱𝑖 𝝁𝑗)
⊺ 𝐱𝑖 𝝁𝑗 𝜋𝑗
𝜔𝑗
𝑛
𝝁𝑗 𝚺𝑗 𝜋𝑗 𝛼𝑖,𝑗
𝜽 𝜶
𝜶 𝜽
𝜽 log 𝑝(ˆ𝐱|𝜽)
𝛼𝑖,𝑗 𝜽
𝜽 𝛼𝑖,𝑗
log 𝑝(ˆ𝐱|𝜽)
EM is sensitive to the initial parameter values, so special care must be taken in the first step. Given
“valid” initial values, however, the EM algorithm guarantees that the log-likelihood increases after
every iteration.

We provide a basic class structure to encapsulate GMM parameters and some utility methods.
2.3 EM Implementation
Compute normalization of the mixture for all the samples
Compute the mixture normalization factors , given the current parameters , for all the data
samples, as:
You will use the result of this function in the expectation step and log-likelihood computation.
Expectation step
Using the values of , you have to compute the posterior probabilities with elements for the mixture
components, as:
Maximization step
Using the values of you have to update the following quantities:
During EM iterations with GMMs, covariance matrices may become singular. To circumvent this
problem, consider regularizing the covariance matrices, when necessary; note, however, that using too
large a regularization can lead to divergence (e.g., a possible reduction in the log-likelihood estimate
between iterations.)

When completing your implementation, you should only require explicit loops over the mixture
component. All other operations can be vectorized with numpy routines, such as sum( , axis=), dot(
, axis=), argmax( , axis=) and the component-wise array operators +, – , * and /. We similarly
provide you with access to scipy’s multivariate_normal class whose (efficient and vectorized) pdf
method implements the Gaussian density evaluation.

Carefully review the GMM class we provide in the base code.
𝜷 = {𝛽0,…, 𝛽𝑛−1} 𝜽
𝛽 := ( | , ) . 𝑖 ∑𝑐=0
𝐾−1
𝜋𝑐 𝐱𝑖 𝝁𝑐 𝚺𝑐
Deliverable 4 [10%]
Complete the implementation of normalization which takes as input the GMM instance and updates
its property gmm.beta.

𝜷
𝛽𝑖 𝜷 𝜶 𝛼𝑖,𝑗
𝛼𝑖,𝑗 =
𝜋𝑗(𝐱𝑖|𝝁𝑗
,𝚺𝑗)
𝛽𝑖
Deliverable 5 [10%]
Complete the implementation of expectation which takes as input the GMM instance and updates its
property gmm.alpha.

𝜶
𝛼𝑖,𝑗 𝜶
Computing the log-likelihood
We can finally compute the log-likelihood estimate using the current :
formatted by Markdeep 1.13
𝑧𝑖 = argmax = = =
𝑗
𝛼𝑖,𝑗 𝜔𝑗 ∑
𝑖=0
𝑛−1
𝛼𝑖,𝑗 𝜋𝑗
𝜔𝑗
𝑛
𝝁𝑗
1
𝜔𝑗 ∑
𝑖=0
𝑛−1
𝛼𝑖,𝑗 𝐱𝑖
𝚺𝑗 = ( − ( − ) +
1
𝜔𝑗 ∑
𝑖=0
𝑛−1
𝛼𝑖,𝑗 𝐱𝑖 𝝁𝑗)
⊺ 𝐱𝑖 𝝁𝑗 𝚺regularization
Deliverable 6 [20%]
Complete the implementation of maximization which updates the , , , , in the input GMM
instance; the associated member variables are gmm.Z, gmm.weight, gmm.pi, gmm.mu, gmm.cov.
Afterwards, you can update the multivariate normal parameters of the mixture elements in gmm.gauss
with and .

𝑧𝑖 𝜔𝑗 𝜋𝑗 𝝁𝑗 𝚺𝑗
𝝁𝑗 𝚺𝑗
ⓘ Don’t forget to regularize the covariance matrices.
𝜷
log 𝑃(ˆ𝐱|𝜽) = ∑log( )
𝑖=0
𝑛−1
𝛽𝑖
Deliverable 7 [10%]
Complete the implementation of logLikelihood which computes the log-likelihood
returning nothing and updating the GMM instance variable gmm.log_likelihoods.

log 𝑃(ˆ𝐱|𝜽)