## Description

1 Discrete and Gaussian Variables

1.1 MLE for General Discrete Distribution

Consider a density estimation task, where we have two variables (d = 2) that can each take one of k discrete

values. For example, we could have

X =

1 3

4 2

k 3

1 k-1

.

The likelihood for example x

i under a general discrete distribution would be

p(x

i

1

, xi

2

|Θ) = θx

i

1

,xi

2

,

where θc1,c2

gives the probability of x1 being in state c1 and x2 being in state c2, for all the k

2

combinations of

the two variables. In order for this to define a valid probability, we need all elements θc1,c2

to be non-negative

and they must sum to one, Pk

c1=1

Pk

c2=1 θc1,c2 = 1.

1. Given n training examples, derive the MLE for the k

2

elements of Θ.

2. Because of the sum-to-1 constraint, there are only (k

2−1) degrees of freedom in the discrete distribution,

and not k

2

. Derive the MLE for this distribution assuming that

θk,k = 1 −

X

k

c1=1

X

k

c2=1

I[c1 6= k, c2 6= k]θc1,c2

,

so that the distribution only has (k

2 − 1) parameters.

3. If we had separate parameter θc1

and θc2

for each variables, a reasonable choice of a prior would be a

product of Dirichlet distributions,

p(θc1

, θc2

) ∝ θ

αc1−1

c1 θ

αc2−1

c2

.

For the general discrete distribution, a prior encoding the same assumptions would be

p(θc1,c2

) ∝ θ

αc1+αc2−2

c1,c2

.

Derive the MAP estimate under this prior.

Hint: it is convenient to write the likelihood for an example i in the form

p(x

i

|Θ) = Y

c∈[k]

2

θ

I[x

i=c]

c

,

1

where c is a vector containing (c1, c2), [x

i = c] evaluates to 1 if all elements are equal, and [k]

2

is all ordered

pairs (c1, c2). You can use the Lagrangian to enforce the sum-to-1 constraint on the log-likelihood, and you

may find it convenient to define Nc =

Pn

i=1 I[x

i = c].

1.2 Generative Classifiers with Gaussian Assumption

Consider the 3-class classification dataset in this image:

In this dataset, we have 2 features and each colour represents one of the classes. Note that the classes are

highly-structured: the colours each roughly follow a Gausian distribution plus some noisy samples.

Since we have an idea of what the features look like for each class, we might consider classifying inputs x

using a generative classifier. In particular, we are going to use Bayes rule to write

p(y = c|x, Θ) = p(x|y = c, Θ) · p(y = c|Θ)

p(x|Θ) ,

where Θ represents the parameters of our model. To classify a new example ˆx, generative classifiers would

use

yˆ = arg max

y∈{1,2,…,k}

p(ˆx|y = c, Θ)p(y = c|Θ),

where in our case the total number of classes k is 3 (The denominator p(ˆx|Θ) is irrelevant to the classification

since it is the same for all y.) Modeling p(y = c|Θ) is eays: we can just use a k-state categorical distribution,

p(y = c|Θ) = θc,

where θc is a single parameter for class c. The maximum likelihood estimate of θc is given by nc/n, the

number of times we have y

i = c (which we’ve called nc) divided by the total number of data points n.

Modeling p(x|y = c, Θ) is the hard part: we need to know the probability of seeing the feature vector x given

that we are in class c. This corresponds to solving a density estimation problem for each of the k possible

classes. To make the density estimation problem tractable, we’ll assume that the distribution of x given that

y = c is given by a N (µc, Σc) Gaussian distribution for a class-specific µc and Σc,

p(x|y = c, Θ) = 1

(2π)

d

2 |Σc|

1

2

exp

−

1

2

(x − µc)

T Σ

−1

c

(x − µc)

.

Since we are distinguishing between the probability under k different Gaussians to make our classification,

this is called Gaussian discriminant analysis (GDA). In the special case where we have a constant Σc = Σ

across all classes it is known as linear discriminant analysis (LDA) since it leads to a linear classifier between

any two classes (while the region of space assigned to each class forms a convex polyhedron as in k-means

2

clustering). Another common restriction on the Σc is that they are diagonal matrices, since this only requires

O(d) parameters instead of O(d

2

) (corresponding to assuming that the features are independent univariate

Gaussians given the class label). Given a dataset D = {(x

i

, yi

)}

n

i=1, where x

i ∈ R

d and y

i ∈ {1, . . . , k}, the

maximum likelihood estimate (MLE) for the µc and Σc in the GDA model is the solution to

arg max

µ1,µ2,…,µk,Σ1,Σ2,…,Σk

Yn

i=1

p(x

i

|y

i

, µyi , Σyi ).

This means that the negative log-likelihood will be equal to

− log p(X|y, Θ) = −

Xn

i=1

log p(x

i

|y

i

|µyi , Σyi )

=

Xn

i=1

1

2

(x

i − µyi )

T Σ

−1

yi (x

i − µyi ) + 1

2

Xn

i=1

log |Σyi | + const.

1. Derive the MLE for the GDA model under the assumption of common diagonal covariance matrices,

Σc = D (d parameters). (Each class will have its own mean µc.)

2. Derive the MLE for the GDA model under the assumption of individual scale-identity matrices, Σc =

σ

2

c

I (k parameters).

3. It’s painful to derive these from scratch, but you should be able to see a pattern that would allow other

common restrictions. Without deriving the result from scratch (hopefully), give the MLE for the case

of individual full covariance matrices, Σc (O(kd2

) parameters).

4. When you run example generative it loads a variant of the dataset in the figure that has 12 features

and 10 classes. This data has been split up into a training and test set, and the code fits a k-nearest

neighbour classifier to the training set then reports the accuracy on the test data (∼ 36%). The knearest neighbour model does poorly here since it doesn’t take into account the Gaussian-like structure

in feature space for each class label. Write a function generativeGaussian that fits a GDA model to

this dataset (using individual full covariance matrices). Hand in the function and report the test set

accuracy.

5. In this question we would like to replace the Gaussian distribution of the previous problem with

the more robust multivariate-t distribution so that it isn’t influenced as much by the noisy data.

Unlike the previous case, we don’t have a closed-form solution for the parameters. However, if you

run example tdist it generates random noisy data and fits a multivariate-t model (you will need to

add the minFunc directory to the Matlab path for the demo to work). By using the multivariateT

model, write a new function generativeStudent that implements a generative model that is based on

the multivariate-t distribution instead of the Gaussian distribution. Report the test accuracy with this

model.

Hints: you will be able to substantially simplify the notation in parts 1-3 if you use the notation P

i∈yc

to

mean the sum over all values i where y

i = c. Similarly, you can use nc to denote the number of cases where

yi = c, so that we have P

i∈yc

1 = nc. Note that the determinant of a diagonal matrix is the product of

the diagonal entries, and the inverse of a diagonal matrix is a diagonal matrix with the reciprocals of the

original matrix along the diagonal. For part three you can use the result from class regarding the MLE of

a general multivariate Gaussian. You may find it helpful to use the included logdet.m function to compute

the log-determinant in more numerically-stable way.

3

1.3 Self-Conjugacy for the Mean Parameter

If x is distributed according to a Gaussian with mean µ,

x ∼ N (µ, σ2

),

and we assume that µ itself is distributed according to a Gaussian

µ ∼ N (α, γ2

),

then the posterior µ|x also follows a Gaussian distribution.1 Derive the form of the (Gaussian) distribution

for p(µ|x, α, σ2

, γ2

).

Hints: Use Bayes rule and use the ∝ sign to get rid of factors that don’t depend on µ. You can “complete

the square” to make the product look like a Gaussian distribution, e.g. when you have exp(ax2 −bx+ const)

you can factor out an a and add/subtract (b/2a)

2

to re-write it as

exp

ax2 − bx + const

∝ exp

ax2 − bx

= exp

a(x

2 − (b/a)x)

∝ exp

a(x

2 − (b/a)x + (b/2a)

2

)

= exp

a(x − (b/2a))2

.

Note that multiplying by factors that do not depend on µ within the exponent does not change the distribution. In this question you will want to complete the square to get the distribution on µ, rather than x.

You may find it easier to solve thie problem if you parameterize the Gaussians in terms of their ‘precision’

parameters (e.g., λ = 1/σ2

, λ0 = 1/γ2

) rather than their variances σ

2 and γ

2

.

2 Mixture Models and Expectation Maximization

2.1 Semi-Supervised Gaussian Discriminant Analysis

Consider fitting a GDA model where some of the y

i values are missing at random. In particular, let’s assume

we have n labeled examples (x

i

, yi

) and then another another t unlabeled examples (x

i

). This is a special

case of semi-supervised learning, and fitting generative models with EM is one of the oldest semi-supervised

learning techqniues. When the classes exhibit clear structure in the feature space, it can be very effective

even if the number of labeled examples is very small.

1. Derive the EM update for fitting the parameters of a GDA model (with individual full covariance

matrices) in the semi-superivsed setting where we have n labeled examples and t unlabeled examples.

2. If you run the demo example SSL, it will load a variant of the dataset from the previous question, but

where the number of labeled examples is small and a large number of unlabeled examples are available.

The demo first fits a KNN model and then a generative Gaussian model (once you are finished Question

1). Because the number of labeled examples it quite small, the performance is worse than in Question

2. Write a function generativeGaussianSSL that fits the generative Gaussian model of the previous

question using EM to incorporate the unlabeled data. Hand in the function and report the test error

when training on the full dataset.

3. Repeat the previous part, but using the “hard”-EM algorithm where we explicitly classify all the

unlabeled examples. How does this change the performance and the number of iterations?

1We say that the Gaussian distribution is the ‘conjugate prior’ for the Gaussian mean parameter (we’ll formally discuss

conjugate priors later in the course). Another reason the Gaussian distribution is important is that is the only (non-trivial)

continuous distribution that has this ‘self-conjugacy’ propert

Hint: for the first question most of the work has been done for you in the EM notes on the course webpage.

You can use the result (**) and the update of θc from those notes, but you will need to work out the update

of the parameters of the Gaussian distribution p(x

i

|y

i

, Θ).

Hint: for the second question, although EM often leads to simple updates, implementing them correctly can

often be a pain. One way to help debug your code is to compute the observed-data log-likelihood after every

iteration. If this number goes down, then you know your implementation has a problem. You can also test

your updates of sets of variables in this way too. For example, if you hold the µc and Σc fixed and only

update the θc, then the log-liklihood should not go down. In this way, you can test each of combinationso

of updates on their to make sure they are correct.

2.2 Mixture of Bernoullis

The function example Bernoulli loads a binarized version of the MNIST dataset and fits a density model

that uses an independent Bernoulli to model each feature. If the reports the average NLL on the test data

and shows 4 samples generated from the model. Unfortunately, the test NLL is infinity and the samples look

terrible.

1. To address the problem that the average NLL is infinity, modify the densityBernoulli function to

implement Laplace smoothing based on an extra argument α. Hand in the code and report the average

NLL with α = 1.

2. Write a new function implementing the mixture of Bernoullis model with Laplace smoothing of the

θ values (note that Laplace smoothing only change the M-step). Hand in the code and report the

average NLL with α = 1 and k = 10 for a particular run of the algorithm, as well as 4 samples from

the model and 4 of the cluster images.

3 Project Proposal

For the final of this assignment, you must a submit a project proposal for your course project. The proposal

should be a maximum of 2 pages (and 1 page or half of a page is ok if you can describe your plan concisely).

The proposal should be written for me and the TAs, so you don’t need to introduce any ML background but

you will need to introduce non-ML topics. The projects must be done is groups of 2-3. If you are doing your

assignment in a group that is different from your project group, only 1 group member should include the

proposal as part of their submission (we’ll do the merge across assignments, and this means that assignments

could have multiple proposals). Please state clearly who is involved with each project proposal.

There is quite a bit of flexibility in terms of the type of project you do, as I believe there are many ways

that people can make valuable contributions to research. However, note that ultimately the project will have

three parts:

1. A very shot paper review summarizing the pros and cons of a particular paper on the topic (due with

Assignment 4).

2. A short literature review summarizing at least 10 papers on a particular topic (due with Assignment

5).

3. A final report containing at most 6 pages of text (the actual document can be longer due to figures,

tables, references, and proofs) that emphasizes a particular “contribution” (i.e., what doing the project

has added to the world).

The reason for this, even though it’s strange for some possible projects, is that this is the standard way that

results are communicated to the research community.

5

The three mains ingredients of the project proposal are:

1. What problem you are focusing on.

2. What you plan to do.

3. What will be the “contribution”.

Also, note that for the course project that negative results (i.e., we tried something that we thought we

would work in a particular setting but it didn’t work) are acceptable.

Here are some standard project “templates” that you might want to follow:

• Application bake-off: you pick a specific application (from your research, personal interests, or

maybe from Kaggle) or a small number of related applications, and try out a bunch of techniques (e.g.,

random forests vs. logistic regression vs. generative models). In this case, the contribution would be

showing that some methods work better than others for this specific application (or your contribution

could be that everything works equally well/badly).

• New application: you pick an application where ML methods have previously applied, and you test

out whether ML methods are effective for the task. In this case, the contribution would be knowing

whether ML is suitable for the task.

• Scaling up: you pick a specific machine learning technique, and you try to figure out how to make it

run faster or on larger datasets (for example, how do we apply kernel methods when n is very large).

In this case, the contribution would be the new technique and an evaluation of its performance.

• Improving performance: you pick a specific machine learning technique, and try to extend it in

some way to improve its performance (for example, how can we efficiently use non-linearity within

graphical models). In this case, the contribution would be the new technique and an evaluation of its

performance.

• Generalization to new setting: you pick a specific machine learning technique, and try to extend

it to a new setting (for example, making a graphical-model version of random forests). In this case,

the contribution would be the new technique and an evaluation of its performance.

• Perspective paper: you pick a specific topic in ML, read a larger number of papers on the topic,

then write a report summarizing what has been done on the topic and what are the most promising

directions of future work. In this case, the contribution would be your summary of the relationships

between the existing works, and your insights about where the field is going.

• Coding project: you pick a specific method or set of methods (like independent component analysis),

and build an implementation of them. In this case, the contribution would be the implementation itself.

• Theory: you pick a theoretical topic (like the variance of cross-validation), read what has been

done about it, and try to prove a new result (usually by relaxing existing assumptions or adding new

assumptions). The contribution could be a new analysis of an existing method, or why some approaches

to analyzing the method will not work.

The above are just suggestions, and many projects will mix several of these templates together, but if you

are having trouble getting going then it’s best to stick with one of the above templates. Also note that the

above includes topics not covered in the course (like random forests), so there is flexibility in the topic, but

the topic should be closely-related to ML.

This question is mandatory but will not be formally marked: it’s just a sanity check that you have at least

one project idea that fits within the scope of 540 course project, and it’s an excuse for you to allocate some

time to thinking about the project. Also, there is flexibility in the choice of project topics even after the

proposal: if you want to explore different topics you can ultimately choose to do a project that is unrelated

6

to the one in your proposal/paper-review/literature-review, although it will likely be easier to do all 4 parts

on the same topic.

7