Description
In this assignment, we’ll fit both generative and discriminative models to the MNIST dataset of handwritten
numbers. Each datapoint in the MNIST [http://yann.lecun.com/exdb/mnist/] dataset is a 28×28 blackand-white image of a number in {0 . . . 9}, and a label indicating which number.
MNIST is the ’fruit fly’ of machine learning – a simple standard problem useful for comparing the properties of
different algorithms. Python code for loading and plotting MNIST is attached.
You can use whichever programming language you like, and libraries for loading and plotting data. You’ll need
to write your own initialization, fitting, and prediction code. You can use automatic differentiation in your code,
but must still answer the gradient questions.
For this assignment, we’ll binarize the dataset, converting the grey pixel values to either black or white (0 or 1)
with > 0.5 being the cutoff. When comparing models, we’ll need a training and test set. Use the first 10000 samples
for training, and another 10000 for testing. This is all done for you in the starter code. Hint: Also build a dataset of
only 100 training samples to use when debugging, to make loading and training faster.
Problem 1 (Basic Na¨ıve Bayes, 10 points)
In this question, we’ll fit a na¨ıve Bayes model to the MNIST digits using maximum likelihood. Na¨ıve Bayes
defines the joint probability of the each datapoint x and its class label c as follows:
p(x, c|θ, π) = p(c|π)p(x|c, θc) = p(c|π)
784
∏
d=1
p(xd
|c, θcd) (1)
For binary data, we can use the Bernoulli likelihood:
p(xd
|c, θcd) = Ber(xd
|θcd) = θ
xd
cd (1 − θcd)
(1−xd
)
(2)
Which is just a way of expressing that p(xd = 1|c, θcd) = θcd.
For p(c|π), we can just use a categorical distribution:
p(c|π) = Cat(c|π) = πc (3)
Note that we need ∑
9
i=0 πi = 1.
(a) Derive the maximum likelihood estimate (MLE) for the class-conditional pixel means θ. Hint: We saw in
lecture that MLE can be thought of as ‘counts’ for the data, so what should ˆθcd be counting?
(b) Derive the maximum a posteriori (MAP) estimate for the class-conditional pixel means θ, using a Beta(2, 2)
prior on each θ. Hint: it has a simple final form, and you can ignore the Beta normalizing constant.
(c) Fit θ to the training set using the MAP estimator. Plot θ as 10 separate greyscale images, one for each
class.
(d) Derive the predictive log-likelihood log p(c|x, θ, π) for a single training image.
(e) Given parameters fit to the training set, and πc = 1
10 , report both the average predictive log-likelihood
per datapoint, 1
N
Σ
N
i=1
log p(ci
|xi
, θ, π) and the predictive accuracy on both the training and test set. The
predictive accuracy is defined as the fraction of examples where the true class t = argmaxc
p(c|x, θ, π).
The takeaway of this question is that we can automatically derive a learning algorithm just by first defining a
joint probability!
1
Problem 2 (Advanced Na¨ıve Bayes, 10 points)
One of the advantages of generative models is that they can handle missing data, or be used to answer different
sorts of questions about the model.
(a) True or false: Given our model’s assumptions, any two pixels xi and xj where i 6= j are independent
given c.
(b) True or false: Given our model’s assumptions, any two pixels xi and xj where i 6= j are independent
when marginalizing over c.
(c) Using the parameters fit in question 1, produce random image samples from the model. That is, randomly
sample and plot 10 binary images from the marginal distribution p(x|θ, π). Hint: Use ancestral sampling.
(d) Derive p(xbottom|xtop, θ, π), the joint distribution over the bottom half of an image given the top half,
conditioned on your fit parameters.
(e) Derive p(xi∈bottom|xtop, θ, π), the marginal distribution of a single pixel in the bottom half of an image
given the top half, conditioned on your fit parameters.
(f) For 20 images from the training set, plot the top half the image concatenated with the marginal distribution over each pixel in the bottom half.
Problem 3 (Logistic Regression, 10 points)
Now, we’ll fit a simple predictive model using gradient descent. Our model will be multiclass logistic regression:
p(c|x, w) = exp(wT
c x)
∑
9
c
0=0
exp(wT
c
0x)
(4)
You can ignore biases for this question.
(a) How many parameters does this model have?
(b) Derive the gradient of the predictive log-likelihood w.r.t. w: ∇w log p(c|x, w)
(c) Code up a gradient-based optimizer of your choosing, it can be just vanilla gradient descent, and use
it to optimize w to maximize the log-likelihood of the training set, and plot the resulting parameters
using one image per class. Since this objective is concave, you can initialize at all zeros. Using automatic
differentiation is permitted, so you can use autograd to get gradients for use by your optimizer, and
using minibatches is optional. However, you are not permitted to use optimizers which come built in to
packages! Hint: Use scipy.logsumexp or its equivalent to make your code more numerically stable.
(d) Given parameters fit to the training set, report both the average predictive log-likelihood per datapoint,
and the predictive accuracy on both the training and test set. How does it compare to Na¨ıve Bayes?
2
Problem 4 (Unsupervised Learning, 10 points)
Another advantage of generative models is that they can be trained in an unsupervised or semi-supervised
manner. In this question, we’ll fit the Na¨ıve Bayes model without using labels. Since we don’t observe labels,
we now have a latent variable model. The probability of an image under this model is given by the marginal
likelihood, integrating over c:
p(x|θ, π) =
k
∑
c=1
p(x, c|θ, π) =
k
∑
c=1
p(c|π)
784
∏
d=1
p(xd
|c, θcd) =
k
∑
c=1
Cat(c|π)
784
∏
d=1
Ber(xd
|θcd) (5)
It turns out that this gives us a mixture model! This model is sometimes called a “mixture of Bernoullis”,
although it would be clearer to say “mixture of products of Bernoullis”. Again, this is just the same Na¨ıve
Bayes model as before, but where we haven’t observed the class labels c. In fact, we are free to choose K, the
number of mixture components.
(a) Given K, how many parameters does this model have?
(b) Derive the gradient of the log marginal likelihood with respect to θ: ∇θ
log p(x|θ, π)
(c) For a fixed πc = 1
K
and K = 30, fit θ on the training set using gradient based optimization. Note: you
can’t initialize at all zeros – you need to break symmetry somehow, which is done for you in starter code.
Starter code reduces this problem to correctly coding the optimization objective. Plot the learned θ. How
do these cluster means compare to the supervised model?
(d) For 20 images from the training set, plot the top half the image concatenated with the marginal distribution over each pixel in the bottom half. Hint: You can re-use the formula for
p(xi∈bottom|xtop, θ, π) from before. How do these compare with the image completions from the supervised model?
3