Description
1 Discrete and Gaussian Variables
1.1 MLE for General Discrete Distribution
Consider a density estimation task, where we have two variables (d = 2) that can each take one of k discrete
values. For example, we could have
X =
1 3
4 2
k 3
1 k-1
.
The likelihood for example x
i under a general discrete distribution would be
p(x
i
1
, xi
2
|Θ) = θx
i
1
,xi
2
,
where θc1,c2
gives the probability of x1 being in state c1 and x2 being in state c2, for all the k
2
combinations of
the two variables. In order for this to define a valid probability, we need all elements θc1,c2
to be non-negative
and they must sum to one, Pk
c1=1
Pk
c2=1 θc1,c2 = 1.
1. Given n training examples, derive the MLE for the k
2
elements of Θ.
2. Because of the sum-to-1 constraint, there are only (k
2−1) degrees of freedom in the discrete distribution,
and not k
2
. Derive the MLE for this distribution assuming that
θk,k = 1 −
X
k
c1=1
X
k
c2=1
I[c1 6= k, c2 6= k]θc1,c2
,
so that the distribution only has (k
2 − 1) parameters.
3. If we had separate parameter θc1
and θc2
for each variables, a reasonable choice of a prior would be a
product of Dirichlet distributions,
p(θc1
, θc2
) ∝ θ
αc1−1
c1 θ
αc2−1
c2
.
For the general discrete distribution, a prior encoding the same assumptions would be
p(θc1,c2
) ∝ θ
αc1+αc2−2
c1,c2
.
Derive the MAP estimate under this prior.
Hint: it is convenient to write the likelihood for an example i in the form
p(x
i
|Θ) = Y
c∈[k]
2
θ
I[x
i=c]
c
,
1
where c is a vector containing (c1, c2), [x
i = c] evaluates to 1 if all elements are equal, and [k]
2
is all ordered
pairs (c1, c2). You can use the Lagrangian to enforce the sum-to-1 constraint on the log-likelihood, and you
may find it convenient to define Nc =
Pn
i=1 I[x
i = c].
1.2 Generative Classifiers with Gaussian Assumption
Consider the 3-class classification dataset in this image:
In this dataset, we have 2 features and each colour represents one of the classes. Note that the classes are
highly-structured: the colours each roughly follow a Gausian distribution plus some noisy samples.
Since we have an idea of what the features look like for each class, we might consider classifying inputs x
using a generative classifier. In particular, we are going to use Bayes rule to write
p(y = c|x, Θ) = p(x|y = c, Θ) · p(y = c|Θ)
p(x|Θ) ,
where Θ represents the parameters of our model. To classify a new example ˆx, generative classifiers would
use
yˆ = arg max
y∈{1,2,…,k}
p(ˆx|y = c, Θ)p(y = c|Θ),
where in our case the total number of classes k is 3 (The denominator p(ˆx|Θ) is irrelevant to the classification
since it is the same for all y.) Modeling p(y = c|Θ) is eays: we can just use a k-state categorical distribution,
p(y = c|Θ) = θc,
where θc is a single parameter for class c. The maximum likelihood estimate of θc is given by nc/n, the
number of times we have y
i = c (which we’ve called nc) divided by the total number of data points n.
Modeling p(x|y = c, Θ) is the hard part: we need to know the probability of seeing the feature vector x given
that we are in class c. This corresponds to solving a density estimation problem for each of the k possible
classes. To make the density estimation problem tractable, we’ll assume that the distribution of x given that
y = c is given by a N (µc, Σc) Gaussian distribution for a class-specific µc and Σc,
p(x|y = c, Θ) = 1
(2π)
d
2 |Σc|
1
2
exp
−
1
2
(x − µc)
T Σ
−1
c
(x − µc)
.
Since we are distinguishing between the probability under k different Gaussians to make our classification,
this is called Gaussian discriminant analysis (GDA). In the special case where we have a constant Σc = Σ
across all classes it is known as linear discriminant analysis (LDA) since it leads to a linear classifier between
any two classes (while the region of space assigned to each class forms a convex polyhedron as in k-means
2
clustering). Another common restriction on the Σc is that they are diagonal matrices, since this only requires
O(d) parameters instead of O(d
2
) (corresponding to assuming that the features are independent univariate
Gaussians given the class label). Given a dataset D = {(x
i
, yi
)}
n
i=1, where x
i ∈ R
d and y
i ∈ {1, . . . , k}, the
maximum likelihood estimate (MLE) for the µc and Σc in the GDA model is the solution to
arg max
µ1,µ2,…,µk,Σ1,Σ2,…,Σk
Yn
i=1
p(x
i
|y
i
, µyi , Σyi ).
This means that the negative log-likelihood will be equal to
− log p(X|y, Θ) = −
Xn
i=1
log p(x
i
|y
i
|µyi , Σyi )
=
Xn
i=1
1
2
(x
i − µyi )
T Σ
−1
yi (x
i − µyi ) + 1
2
Xn
i=1
log |Σyi | + const.
1. Derive the MLE for the GDA model under the assumption of common diagonal covariance matrices,
Σc = D (d parameters). (Each class will have its own mean µc.)
2. Derive the MLE for the GDA model under the assumption of individual scale-identity matrices, Σc =
σ
2
c
I (k parameters).
3. It’s painful to derive these from scratch, but you should be able to see a pattern that would allow other
common restrictions. Without deriving the result from scratch (hopefully), give the MLE for the case
of individual full covariance matrices, Σc (O(kd2
) parameters).
4. When you run example generative it loads a variant of the dataset in the figure that has 12 features
and 10 classes. This data has been split up into a training and test set, and the code fits a k-nearest
neighbour classifier to the training set then reports the accuracy on the test data (∼ 36%). The knearest neighbour model does poorly here since it doesn’t take into account the Gaussian-like structure
in feature space for each class label. Write a function generativeGaussian that fits a GDA model to
this dataset (using individual full covariance matrices). Hand in the function and report the test set
accuracy.
5. In this question we would like to replace the Gaussian distribution of the previous problem with
the more robust multivariate-t distribution so that it isn’t influenced as much by the noisy data.
Unlike the previous case, we don’t have a closed-form solution for the parameters. However, if you
run example tdist it generates random noisy data and fits a multivariate-t model (you will need to
add the minFunc directory to the Matlab path for the demo to work). By using the multivariateT
model, write a new function generativeStudent that implements a generative model that is based on
the multivariate-t distribution instead of the Gaussian distribution. Report the test accuracy with this
model.
Hints: you will be able to substantially simplify the notation in parts 1-3 if you use the notation P
i∈yc
to
mean the sum over all values i where y
i = c. Similarly, you can use nc to denote the number of cases where
yi = c, so that we have P
i∈yc
1 = nc. Note that the determinant of a diagonal matrix is the product of
the diagonal entries, and the inverse of a diagonal matrix is a diagonal matrix with the reciprocals of the
original matrix along the diagonal. For part three you can use the result from class regarding the MLE of
a general multivariate Gaussian. You may find it helpful to use the included logdet.m function to compute
the log-determinant in more numerically-stable way.
3
1.3 Self-Conjugacy for the Mean Parameter
If x is distributed according to a Gaussian with mean µ,
x ∼ N (µ, σ2
),
and we assume that µ itself is distributed according to a Gaussian
µ ∼ N (α, γ2
),
then the posterior µ|x also follows a Gaussian distribution.1 Derive the form of the (Gaussian) distribution
for p(µ|x, α, σ2
, γ2
).
Hints: Use Bayes rule and use the ∝ sign to get rid of factors that don’t depend on µ. You can “complete
the square” to make the product look like a Gaussian distribution, e.g. when you have exp(ax2 −bx+ const)
you can factor out an a and add/subtract (b/2a)
2
to re-write it as
exp
ax2 − bx + const
∝ exp
ax2 − bx
= exp
a(x
2 − (b/a)x)
∝ exp
a(x
2 − (b/a)x + (b/2a)
2
)
= exp
a(x − (b/2a))2
.
Note that multiplying by factors that do not depend on µ within the exponent does not change the distribution. In this question you will want to complete the square to get the distribution on µ, rather than x.
You may find it easier to solve thie problem if you parameterize the Gaussians in terms of their ‘precision’
parameters (e.g., λ = 1/σ2
, λ0 = 1/γ2
) rather than their variances σ
2 and γ
2
.
2 Mixture Models and Expectation Maximization
2.1 Semi-Supervised Gaussian Discriminant Analysis
Consider fitting a GDA model where some of the y
i values are missing at random. In particular, let’s assume
we have n labeled examples (x
i
, yi
) and then another another t unlabeled examples (x
i
). This is a special
case of semi-supervised learning, and fitting generative models with EM is one of the oldest semi-supervised
learning techqniues. When the classes exhibit clear structure in the feature space, it can be very effective
even if the number of labeled examples is very small.
1. Derive the EM update for fitting the parameters of a GDA model (with individual full covariance
matrices) in the semi-superivsed setting where we have n labeled examples and t unlabeled examples.
2. If you run the demo example SSL, it will load a variant of the dataset from the previous question, but
where the number of labeled examples is small and a large number of unlabeled examples are available.
The demo first fits a KNN model and then a generative Gaussian model (once you are finished Question
1). Because the number of labeled examples it quite small, the performance is worse than in Question
2. Write a function generativeGaussianSSL that fits the generative Gaussian model of the previous
question using EM to incorporate the unlabeled data. Hand in the function and report the test error
when training on the full dataset.
3. Repeat the previous part, but using the “hard”-EM algorithm where we explicitly classify all the
unlabeled examples. How does this change the performance and the number of iterations?
1We say that the Gaussian distribution is the ‘conjugate prior’ for the Gaussian mean parameter (we’ll formally discuss
conjugate priors later in the course). Another reason the Gaussian distribution is important is that is the only (non-trivial)
continuous distribution that has this ‘self-conjugacy’ propert
Hint: for the first question most of the work has been done for you in the EM notes on the course webpage.
You can use the result (**) and the update of θc from those notes, but you will need to work out the update
of the parameters of the Gaussian distribution p(x
i
|y
i
, Θ).
Hint: for the second question, although EM often leads to simple updates, implementing them correctly can
often be a pain. One way to help debug your code is to compute the observed-data log-likelihood after every
iteration. If this number goes down, then you know your implementation has a problem. You can also test
your updates of sets of variables in this way too. For example, if you hold the µc and Σc fixed and only
update the θc, then the log-liklihood should not go down. In this way, you can test each of combinationso
of updates on their to make sure they are correct.
2.2 Mixture of Bernoullis
The function example Bernoulli loads a binarized version of the MNIST dataset and fits a density model
that uses an independent Bernoulli to model each feature. If the reports the average NLL on the test data
and shows 4 samples generated from the model. Unfortunately, the test NLL is infinity and the samples look
terrible.
1. To address the problem that the average NLL is infinity, modify the densityBernoulli function to
implement Laplace smoothing based on an extra argument α. Hand in the code and report the average
NLL with α = 1.
2. Write a new function implementing the mixture of Bernoullis model with Laplace smoothing of the
θ values (note that Laplace smoothing only change the M-step). Hand in the code and report the
average NLL with α = 1 and k = 10 for a particular run of the algorithm, as well as 4 samples from
the model and 4 of the cluster images.
3 Project Proposal
For the final of this assignment, you must a submit a project proposal for your course project. The proposal
should be a maximum of 2 pages (and 1 page or half of a page is ok if you can describe your plan concisely).
The proposal should be written for me and the TAs, so you don’t need to introduce any ML background but
you will need to introduce non-ML topics. The projects must be done is groups of 2-3. If you are doing your
assignment in a group that is different from your project group, only 1 group member should include the
proposal as part of their submission (we’ll do the merge across assignments, and this means that assignments
could have multiple proposals). Please state clearly who is involved with each project proposal.
There is quite a bit of flexibility in terms of the type of project you do, as I believe there are many ways
that people can make valuable contributions to research. However, note that ultimately the project will have
three parts:
1. A very shot paper review summarizing the pros and cons of a particular paper on the topic (due with
Assignment 4).
2. A short literature review summarizing at least 10 papers on a particular topic (due with Assignment
5).
3. A final report containing at most 6 pages of text (the actual document can be longer due to figures,
tables, references, and proofs) that emphasizes a particular “contribution” (i.e., what doing the project
has added to the world).
The reason for this, even though it’s strange for some possible projects, is that this is the standard way that
results are communicated to the research community.
5
The three mains ingredients of the project proposal are:
1. What problem you are focusing on.
2. What you plan to do.
3. What will be the “contribution”.
Also, note that for the course project that negative results (i.e., we tried something that we thought we
would work in a particular setting but it didn’t work) are acceptable.
Here are some standard project “templates” that you might want to follow:
• Application bake-off: you pick a specific application (from your research, personal interests, or
maybe from Kaggle) or a small number of related applications, and try out a bunch of techniques (e.g.,
random forests vs. logistic regression vs. generative models). In this case, the contribution would be
showing that some methods work better than others for this specific application (or your contribution
could be that everything works equally well/badly).
• New application: you pick an application where ML methods have previously applied, and you test
out whether ML methods are effective for the task. In this case, the contribution would be knowing
whether ML is suitable for the task.
• Scaling up: you pick a specific machine learning technique, and you try to figure out how to make it
run faster or on larger datasets (for example, how do we apply kernel methods when n is very large).
In this case, the contribution would be the new technique and an evaluation of its performance.
• Improving performance: you pick a specific machine learning technique, and try to extend it in
some way to improve its performance (for example, how can we efficiently use non-linearity within
graphical models). In this case, the contribution would be the new technique and an evaluation of its
performance.
• Generalization to new setting: you pick a specific machine learning technique, and try to extend
it to a new setting (for example, making a graphical-model version of random forests). In this case,
the contribution would be the new technique and an evaluation of its performance.
• Perspective paper: you pick a specific topic in ML, read a larger number of papers on the topic,
then write a report summarizing what has been done on the topic and what are the most promising
directions of future work. In this case, the contribution would be your summary of the relationships
between the existing works, and your insights about where the field is going.
• Coding project: you pick a specific method or set of methods (like independent component analysis),
and build an implementation of them. In this case, the contribution would be the implementation itself.
• Theory: you pick a theoretical topic (like the variance of cross-validation), read what has been
done about it, and try to prove a new result (usually by relaxing existing assumptions or adding new
assumptions). The contribution could be a new analysis of an existing method, or why some approaches
to analyzing the method will not work.
The above are just suggestions, and many projects will mix several of these templates together, but if you
are having trouble getting going then it’s best to stick with one of the above templates. Also note that the
above includes topics not covered in the course (like random forests), so there is flexibility in the topic, but
the topic should be closely-related to ML.
This question is mandatory but will not be formally marked: it’s just a sanity check that you have at least
one project idea that fits within the scope of 540 course project, and it’s an excuse for you to allocate some
time to thinking about the project. Also, there is flexibility in the choice of project topics even after the
proposal: if you want to explore different topics you can ultimately choose to do a project that is unrelated
6
to the one in your proposal/paper-review/literature-review, although it will likely be easier to do all 4 parts
on the same topic.
7