## Description

The learning objectives of this assignment are to

explore and implement a:

maximum likelihood estimator,

maximum a posteriori estimator, and

the expectation-maximization algorithm.

In doing so, you’ll revisit the polynomial fitting problem before moving on to a more sophisticated mixture model

classification problem.

1 Polynomial regression with MLE and MAP [40%]

We begin by exploring a maximum likelihood estimator (MLE) and maximum a posteriori (MAP) estimator on the

familiar polynomial regression problem.

Comparing degree-7 polynomial fits to noisy data and across estimators.

1.1 Polynomial regression

The overdetermined degree- polynomial regression problem — with an explicit corruption/noise model on the

data — seeks an interpolant across data samples that satisfies:

where:

is the independent coordinate of sample , with ,

is the dependent coordinate of sample , with ,

is standard Gaussian noise corrupting the outputs , and

are the polynomial coefficients we are solving for.

Note that one common way to rewrite this model is by “folding in” the deterministic component into the mean of

the Gaussian, as:

1.2 Maximum likelihood estimation (MLE) [20%]

You will implement a MLE for the polynomial parameters that maximize the data’s likelihood function:

where — assuming the samples are drawn i.i.d. — the likelihood function can be expressed using the

normal distribution’s density, as

Taking the log of the likelihood before taking the argmax — which is a valid transformation under argmax, given

log’s monotonicity — yields:

which we optimize for by setting the derivatives w.r.t. the parameters to zero,

before combining these partial derivatives to form a linear system , with

𝑚

𝑛 (𝑥𝑖, 𝑦𝑖)

𝑦𝑖 = ∑ + (𝑖 = 1,…, 𝑛)

𝑗=0

𝑚

𝜃𝑗𝑥

𝑗

𝑖 𝜖𝑖

𝑥𝑖 𝑖 𝐱 = {𝑥0 …𝑥𝑛−1}

𝑦𝑖 𝑖 𝐲 = {𝑦0 …𝑦𝑛−1}

𝜖𝑖 ∼ (𝜇, 𝜎 )

2 𝑦𝑖

𝜽 = {𝜃0, 𝜃1,…, 𝜃𝑚} 𝑝 = 𝑚 + 1

∼

(

𝜇 + , )

𝑦 . 𝑖 ∑

𝑗=0

𝑚

𝜃𝑗𝑥

𝑗

𝑖 𝜎

2

𝜽

𝜽MLE := argmax 𝑝(𝐲|𝐱, 𝜽) ,

𝜽

(𝑥𝑖, 𝑦𝑖)

𝑝(𝐲|𝐱, 𝜽) = ∏𝑝( | , 𝜽) = exp − .

𝑖=0

𝑛−1

𝑦𝑖 𝑥𝑖 ∏

𝑖=0

𝑛−1

1

2𝜋𝜎

2 √‾‾‾‾‾

⎛

⎝

⎜

⎜

1

2𝜎

2 [

− 𝜇 −

]

𝑦𝑖 ∑

𝑗=0

𝑚

𝜃𝑗𝑥

𝑗

𝑖

2⎞

⎠

⎟

⎟

log(𝑝(𝐲|𝐱, 𝜽)) = log(

𝑝( | , 𝜽)) ∏ = −𝑛 log( )) −

𝑖=0

𝑛−1

𝑦𝑖 𝑥𝑖 2𝜋𝜎

2 √‾‾‾‾‾

1

2𝜎

2 ∑

𝑖=0

𝑛−1

[

− 𝜇 −

]

𝑦𝑖 ∑

𝑗=0

𝑚

𝜃𝑗𝑥

𝑗

𝑖

2

log(𝑝(𝐲|𝐱, 𝜽)) =

[

− 𝜇 −

]

= 0 ,

∂

∂𝜃𝑘

1

𝜎

2 ∑

𝑖=0

𝑛−1

𝑥

𝑘

𝑖 𝑦𝑖 ∑

𝑗=0

𝑚

𝜃𝑗𝑥

𝑗

𝑖

𝐀 𝜽MLE = 𝐛

Solving this system yields the MLE for .

1.3 Maximum a posteriori (MAP) [20%]

Maximum a posteriori (MAP) estimators consider an additional prior over the parameters and solve for

parameters that maximize the (log) posterior distribution of the parameters, given the data:

In the polynomial regression setting, we will assume that the are drawn i.i.d from a normal distribution

, so their joint prior probability density is given by:

The derivative with respect to each parameter of is

and we combine these partials together to form a linear system that we can solve for the MAP

estimate of the parameters, where

𝐴𝑖,𝑗 = ∑ and = ( − 𝜇) ; 𝑖,𝑗 ∈ [0,𝑀]

𝑠=0

𝑛−1

𝑥

𝑖+𝑗

𝑠 𝑏𝑖 ∑𝑠=0

𝑛−1

𝑥

𝑖

𝑠 𝑦𝑠

𝜽MLE

Deliverable 1 [20%]

Complete the implementation of the function PolRegMLE that takes inputs (equal to ), , ,

and and outputs the system matrix and RHS for the MLE. Notice how the system does not

depend on , as discussed in class.

☑

𝑝 𝑚 + 1 𝜇 𝐱

𝐲 𝐀 𝐛

𝜎

Unlike the first part of A1, deliverables 1, 2 and 3 below won’t solve the linear system, but only form it.

You can explore numpy’s many linear system solving routines (e.g, numpy.linalg.solve) in order to

compute and assess the solutions of your various estimator systems.

ⓘ

𝜽∗

𝜽MAP

𝜽MA := 𝑝(𝜽|𝐲, 𝐱) = 𝑝(𝐲|𝐱, 𝜽) 𝑝(𝜽) . P argmax

𝜽

argmax

𝜽

𝜽

(𝜇 , ) 𝜽 𝜎

2

𝜽

𝑝(𝜽) = 𝑝( ) = exp (

− ( − ) ∏ .

𝑘=0

𝑚

𝜃𝑘 ∏

𝑘=0

𝑚

1

2𝜋𝜎

2

𝜽

‾‾‾‾‾ √

1

2𝜎

2

𝜽

𝜃𝑘 𝜇𝜽)

2

𝜃𝑘

log(𝑝(𝐲|𝐱, 𝜽) 𝑝(𝜽)) ≡ 𝑙(𝜽|𝐱, 𝐲)

𝑙(𝜽|𝐱, 𝐲) =

[

− 𝜇 −

]

− ( − ) ,

∂

∂𝜃𝑘

1

𝜎

2 ∑

𝑖=0

𝑛−1

𝑥

𝑘

𝑖 𝑦𝑖 ∑

𝑗=0

𝑚

𝜃𝑗𝑥

𝑗

𝑖

1

𝜎

2

𝜽

𝜃𝑘 𝜇𝜽

𝐀𝜽MAP = 𝐛

𝐴𝑖,𝑗 = and = ( − 𝜇) + ; 𝑖,𝑗 ∈ [0, 𝑚]

⎧

⎩

⎨

⎪

⎪

⎪

⎪

∑ +

𝑠=0

𝑛−1

𝑥

𝑖+𝑗

𝑠

𝜎

2

𝜎

2

𝜽

∑𝑠=0

𝑛−1

𝑥

𝑖+𝑗

𝑠

if 𝑖 = 𝑗

otherwise

𝑏𝑖 ∑𝑠=0

𝑛−1

𝑥

𝑖

𝑠 𝑦𝑠 𝜇𝜽

𝜎

2

𝜎

2

𝜽

Deliverable 2 [15%]

Complete the implementation of the function PolRegMAP with inputs , , , , , , and

and that outputs the MAP linear system matrix and RHS .

☑

𝑝 = 𝑚 + 1 𝜇 𝜎 𝜇𝜽 𝜎𝜽 𝐱

𝐲 𝐀 𝐛

As discussed in class, the linear least squares solution to the polynomial regression problem is equivalent to the

MLE when the noise is assumed to be normally distributed with zero mean. As such, we should expect that MLE

and LLS yield the same parameters — with the exception of the constant term — when the noise is not of zero

mean.

2 Mixture Models with Expectation-Maximization [60%]

We will now explore the expectation-maximization (EM) algorithm through the lens of a Gaussian mixture model

(GMM) example.

Fitting a GMM to a randomly generated, multi-modal 2D dataset using the EM algorithm.

Consider many -dimensional data samples drawn from a mixture model parameterized by .

Our example will use a mixture of multivariate normal distributions (i.e., high-dimensional, anisotropic

Gaussians), with mixture components, and so our , where

each of the -dimensional Gaussians are parameterized by a -dimensional mean vector , a

covariance matrix (see below), and

the normalized mixture proportions weight the amount of each of the components in the mixture, with

.

We use the standard definition of a -dimensional multivariate normal distribution, with density:

𝜃0

Deliverable 3 [5%]

Revisit LLS by completing the implementation of PolRegLLS with inputs , , and and

that outputs the appropriate fully-constrained linear system matrix and RHS needed to construct

the LLS system.

☑

𝑝 = 𝑚 + 1 𝑛 𝐱 𝐲

𝐀 𝐛

Our evaluation of your GMM code will impose a (sufficiently lenient) time limit on the execution your

algorithm — be mindful to avoid loops and to leverage numpy vectorization whenever suitable. ⓘ

𝑑 𝐱𝑖 𝜽

(𝐱|𝝁,𝚺)

𝐾 𝜽 = {𝝁0,…, 𝝁𝐾−1,𝚺0,…,𝚺𝐾−1, 𝜋0,…, 𝜋𝐾−1}

𝐾 𝑑 𝑑 𝝁𝑖 𝑑 × 𝑑

𝚺𝑖

𝜋𝑐 𝐾

∑ = 1

𝐾−1

𝑐=0 𝜋𝑐

𝑑

(𝐱|𝝁,𝚺) = exp[− (𝐱 − 𝝁 (𝐱 − 𝝁)]

,

1

(2𝜋)

𝑑/2

|𝚺|

1/2

1

2

)

⊺𝚺

−1

and so the marginal distribution of is:

As such, the joint probability of i.i.d. data samples is

And so, given the samples and knowing only the number of mixture components , our goal is to estimate the

mixture’s parameters that best match the data: the means (i.e., centers), covariances (i.e., anisotropic

“shapes”), and weights of the mixture components.

2.1 From MLE to Expectation-Maximization

A natural first step is to derive an MLE for . The likelihood function is simply the joint probability above,

and so the log-likelihood is

We immediately run into difficulties when differentiating with respect to, say, :

as we are unable to isolate the : the non linearity induced by the

term. We face similar problems when differentiating with respect

to other parameters in .

To sidestep this issue, we will introduce the concept of a latent label for each data sample :

this latent label corresponds to the ground truth association of each data sample to its corresponding mixture

component, and so .

By the law of total probability, the marginal conditional probability distribution of given can now be rewritten as

Intuitively, knowing the latent variables should help us in computing the MLE. To do so, we first compute the

vector posterior with elements using Bayes’ theorem:

𝐱𝑖

𝑝(𝐱 |𝜽) = ( | , ) . 𝑖 ∑𝑐=0

𝐾−1

𝜋𝑐 𝐱𝑖 𝝁𝑐 𝚺𝑐

𝑛 ˆ𝐱 = {𝐱0,…, 𝐱𝑛−1}

𝑝(ˆ𝐱|𝜽) = ∏ ( | , ) .

𝑖=0

𝑛−1

∑𝑐=0

𝐾−1

𝜋𝑐 𝐱𝑖 𝝁𝑐 𝚺𝑐

ˆ𝐱 𝐾

𝜽 𝝁𝑐 𝚺𝑐

𝜋𝑐

𝜽 𝑝(ˆ𝐱|𝜽)

log 𝑝( |𝜽) = log(

( | , ))

ˆ𝐱 ∑ .

𝑖=0

𝑛−1

∑𝑐=0

𝐾−1

𝜋𝑐 𝐱𝑖 𝝁𝑐 𝚺𝑐

𝝁𝑘

log 𝑝( |𝜽) = ( − ) = 0 ,

∂

∂𝝁𝑗

ˆ𝐱 ∑

𝑖=0

𝑛−1

⎛

⎝

⎜

⎜

⎜

⎜

𝜋𝑗(𝐱𝑖|𝝁𝑗

,𝚺𝑗)

∑ ( | , )

𝑐=0

𝐾−1

𝜋𝑐 𝐱𝑖 𝝁𝑐 𝚺𝑐

⎞

⎠

⎟

⎟

⎟

⎟

𝐱𝑖 𝝁𝑗 𝚺

−1

𝑗

𝝁𝑘

( ( | , )) / ( ( | , ))

𝜋𝑗 𝐱𝑖 𝝁𝑗 𝚺𝑗 ∑

𝑐=0

𝐾−1

𝜋𝑐 𝐱𝑖 𝝁𝑐 𝚺𝑐

𝜽

𝑧𝑖 ∈ {1,…, 𝐾} 𝐱𝑖

𝜋𝑐 = 𝑝(𝑧𝑖 = 𝑐|𝜽)

𝐱𝑖 𝜽

𝑝(𝐱 |𝜽) = . 𝑖 ∑𝑐=0

𝐾−1

𝑝(𝑧𝑖 = 𝑐|𝜽)

𝜋𝑐

𝑝(𝐱𝑖|𝑧𝑖 = 𝑐, 𝜽)

(𝐱𝑖|𝝁 , ) 𝑐 𝚺𝑐

𝑧𝑖

𝜶 𝛼𝑖,𝑗 = 𝑝(𝑧𝑖 = 𝑘|𝐱𝑖, 𝜽)

We can now rewrite the partial derivative of the log-likelihood with respect to and set it to zero as:

Even though depends on , we can define an iterative process that assumes we have an approximate value

of and solving for as:

where can be interpreted as the effective number of data samples assigned to the mixture

component . We can similarly obtain expressions for the other parameters as:

After computing these (updated) values for the , and , we can recompute new estimates for from the

expression we derived using Bayes’ theorem. We repeat this process until “convergence”.

2.2 Formalizing the EM Algorithm

To summarize:

if we knew the parameters , we could compute the posterior probabilities , and

if we knew the posteriors , we could compute the parameters .

The EM algorithm proceeds as follows:

0. Initialize the parameters and evaluate the log-likelihood with these parameters,

1. E-step: evaluate the posterior probabilities using the current values of the parameters ,

2. M-step: estimate new parameters with the current values of ,

3. Evaluate the log-likelihood with the new parameter estimates — if the log-likelihood has

changed by less than some (typically small) convergence threshold, terminate; otherwise, repeat steps 1

through 3.

𝛼 := 𝑝( = 𝑗| , 𝜽) = = . 𝑖,𝑗 𝑧𝑖 𝐱𝑖

𝑝(𝑧𝑖 = 𝑗|𝜽)𝑝(𝐱𝑖|𝑧𝑖 = 𝑗, 𝜽)

𝑝(𝐱𝑖|𝜽)

𝜋𝑗(𝐱𝑖|𝝁𝑗

,𝚺𝑗)

∑ ( | , )

𝑐=0

𝐾−1

𝜋𝑐 𝐱𝑖 𝝁𝑐 𝚺𝑐

𝝁𝑗

log 𝑝( |𝜽) = ( − ) = 0 .

∂

∂𝝁𝑗

ˆ𝐱 ∑

𝑖=0

𝑛−1

𝛼𝑖,𝑗 𝐱𝑖 𝝁𝑗 𝚺

−1

𝑗

𝛼𝑖,𝑗 𝝁𝑗

𝛼𝑖,𝑗 𝝁𝑗

𝝁𝑗 =

1

𝜔𝑗 ∑

𝑖=0

𝑛−1

𝛼𝑖,𝑗 𝐱𝑖

𝜔𝑗 = ∑

𝑛−1

𝑖=0 𝛼𝑖,𝑗

𝑗

𝚺 = ( − ( − ) and = . 𝑗

1

𝜔𝑗 ∑

𝑖=0

𝑛−1

𝛼𝑖,𝑗 𝐱𝑖 𝝁𝑗)

⊺ 𝐱𝑖 𝝁𝑗 𝜋𝑗

𝜔𝑗

𝑛

𝝁𝑗 𝚺𝑗 𝜋𝑗 𝛼𝑖,𝑗

𝜽 𝜶

𝜶 𝜽

𝜽 log 𝑝(ˆ𝐱|𝜽)

𝛼𝑖,𝑗 𝜽

𝜽 𝛼𝑖,𝑗

log 𝑝(ˆ𝐱|𝜽)

EM is sensitive to the initial parameter values, so special care must be taken in the first step. Given

“valid” initial values, however, the EM algorithm guarantees that the log-likelihood increases after

every iteration.

ⓘ

We provide a basic class structure to encapsulate GMM parameters and some utility methods.

2.3 EM Implementation

Compute normalization of the mixture for all the samples

Compute the mixture normalization factors , given the current parameters , for all the data

samples, as:

You will use the result of this function in the expectation step and log-likelihood computation.

Expectation step

Using the values of , you have to compute the posterior probabilities with elements for the mixture

components, as:

Maximization step

Using the values of you have to update the following quantities:

During EM iterations with GMMs, covariance matrices may become singular. To circumvent this

problem, consider regularizing the covariance matrices, when necessary; note, however, that using too

large a regularization can lead to divergence (e.g., a possible reduction in the log-likelihood estimate

between iterations.)

ⓘ

When completing your implementation, you should only require explicit loops over the mixture

component. All other operations can be vectorized with numpy routines, such as sum( , axis=), dot(

, axis=), argmax( , axis=) and the component-wise array operators +, – , * and /. We similarly

provide you with access to scipy’s multivariate_normal class whose (efficient and vectorized) pdf

method implements the Gaussian density evaluation.

ⓘ

Carefully review the GMM class we provide in the base code.

𝜷 = {𝛽0,…, 𝛽𝑛−1} 𝜽

𝛽 := ( | , ) . 𝑖 ∑𝑐=0

𝐾−1

𝜋𝑐 𝐱𝑖 𝝁𝑐 𝚺𝑐

Deliverable 4 [10%]

Complete the implementation of normalization which takes as input the GMM instance and updates

its property gmm.beta.

☑

𝜷

𝛽𝑖 𝜷 𝜶 𝛼𝑖,𝑗

𝛼𝑖,𝑗 =

𝜋𝑗(𝐱𝑖|𝝁𝑗

,𝚺𝑗)

𝛽𝑖

Deliverable 5 [10%]

Complete the implementation of expectation which takes as input the GMM instance and updates its

property gmm.alpha.

☑

𝜶

𝛼𝑖,𝑗 𝜶

Computing the log-likelihood

We can finally compute the log-likelihood estimate using the current :

formatted by Markdeep 1.13

𝑧𝑖 = argmax = = =

𝑗

𝛼𝑖,𝑗 𝜔𝑗 ∑

𝑖=0

𝑛−1

𝛼𝑖,𝑗 𝜋𝑗

𝜔𝑗

𝑛

𝝁𝑗

1

𝜔𝑗 ∑

𝑖=0

𝑛−1

𝛼𝑖,𝑗 𝐱𝑖

𝚺𝑗 = ( − ( − ) +

1

𝜔𝑗 ∑

𝑖=0

𝑛−1

𝛼𝑖,𝑗 𝐱𝑖 𝝁𝑗)

⊺ 𝐱𝑖 𝝁𝑗 𝚺regularization

Deliverable 6 [20%]

Complete the implementation of maximization which updates the , , , , in the input GMM

instance; the associated member variables are gmm.Z, gmm.weight, gmm.pi, gmm.mu, gmm.cov.

Afterwards, you can update the multivariate normal parameters of the mixture elements in gmm.gauss

with and .

☑

𝑧𝑖 𝜔𝑗 𝜋𝑗 𝝁𝑗 𝚺𝑗

𝝁𝑗 𝚺𝑗

ⓘ Don’t forget to regularize the covariance matrices.

𝜷

log 𝑃(ˆ𝐱|𝜽) = ∑log( )

𝑖=0

𝑛−1

𝛽𝑖

Deliverable 7 [10%]

Complete the implementation of logLikelihood which computes the log-likelihood

returning nothing and updating the GMM instance variable gmm.log_likelihoods.

☑

log 𝑃(ˆ𝐱|𝜽)

✒