Description
1 Na¨ıve Bayes over Multinomial Distribution [30 pts]
In this question, we will look into training a na¨ıve Bayes classifier with a model that uses a multinomial distribution to represent documents. Assume that all the documents are written in a language
which has only three words a, b, and c. All the documents have exactly n words (each word can
be either a, b, or c). We are given a labeled document collection {D1, D2, . . . , Dm}. The label yi
of document Di
is 1 or 0, indicating whether Di
is “good” or “bad”.
This model uses the multinominal distribution in the following way: Given the i
th document Di
,
we denote by ai (respectively, bi
, ci) the number of times that word a (respectively, b, c) appears
in Di
. Therefore, ai + bi + ci = |Di
| = n. We define
Pr(Di
|y = 1) = n!
ai
!bi
!ci
!
α
ai
1
β
bi
1
γ
ci
1
where α1 (respectively, β1, γ1) is the probability that word a (respectively, b, c) appears in a “good”
document. Therefore, α1 + β1 + γ1 = 1. Similarly,
Pr(Di
|y = 0) = n!
ai
!bi
!ci
!
α
ai
0
β
bi
0
γ
ci
0
where α0 (respectively, β0, γ0) is the probability that word a (respectively, b, c) appears in a “bad”
document. Therefore, α0 + β0 + γ0 = 1.
(a) (3 pts) What information do we lose when we represent documents using the aforementioned
model?
(b) (7 pts) Write down the expression for the log likelihood of the document Di
, log Pr(Di
, yi).
Assume that the prior probability, P r(yi = 1) is θ.
(c) (20 pts) Derive the expression for the maximum likelihood estimates for parameters α1, β1,
γ1, α0, β0, and γ0.
Submission note: You need not show the derivation of all six parameters separately. Some
parameters are symmetric to others, and so, once you derive the expression for one, you can
directly write down the expression for others.
Grading note: 5 points for the derivation of one of the parameters, 3 points each for the
remaining five parameter expressions.
1
2 Hidden Markov Models [15 pts]
Consider a Hidden Markov Model with two hidden states, {1, 2}, and two possible output symbols,
{A, B}. The initial state probabilities are
π1 = P(q1 = 1) = 0.49 and π2 = P(q1 = 2) = 0.51,
the state transition probabilities are
q11 = P(qt+1 = 1|qt = 1) = 1 and q12 = P(qt+1 = 1|qt = 2) = 1,
and the output probabilities are
e1(A) = P(Ot = A|qt = 1) = 0.99 and e2(B) = P(Ot = B|qt = 2) = 0.51.
Throughout this problem, make sure to show your work to receive full credit.
(a) (5 pts) There are two unspecified transition probabilities and two unspecified output probabilities. What are the missing probabilities, and what are their values?
(b) (5 pts) What is the most frequent output symbol (A or B) to appear in the first position of
sequences generated from this HMM?
(c) (5 pts) What is the sequence of three output symbols that has the highest probability of
being generated from this HMM model?
3 Facial Recognition by using K-Means and K-Medoids [55 pts]
Machine learning techniques have been applied to a variety of image interpretation problems. In
this problem, you will investigate facial recognition, which can be treated as a clustering problem
(“separate these pictures of Joe and Mary”).
For this problem, we will use a small part of a huge database of faces of famous people (Labeled
Faces in the Wild [LFW] people dataset1
). The images have already been cropped out of the original
image, and scaled and rotated so that the eyes and mouth are roughly in alignment; additionally,
we will use a version that is scaled down to a manageable size of 50 by 37 pixels (for a total of 1850
“raw” features). Our dataset has a total of 1867 images of 19 different people. You will explore
clustering methods such as k-means and k-medoids to the problem of facial recognition on this
dataset.
Download the starter files from the course website. It contains the following source files:
• util.py – Utility methods for manipulating data.
• cluster.py – Code for the Point, Cluster, and ClusterSet classes.
• faces.py – Main code
Please note that you do not necessarily have to follow the skeleton code perfectly. We encourage
you to include your own additional methods and functions. However, you are not allowed to use
any scikit-learn classes or functions other than those already imported in the skeleton code.
1
http://vis-www.cs.umass.edu/lfw/
2
We will explore clustering algorithms in detail by applying them to a toy dataset. In particular,
we will investigate k-means and k-medoids (a slight variation on k-means).
(a) (5 pts) In k-means, we attempt to find k cluster centers µj ∈ R
d
, j ∈ {1, . . . , k} and n
cluster assignments c
(i) ∈ {1, . . . , k}, i ∈ {1, . . . , n}, such that the total distance between each
data point and the nearest cluster center is minimized. In other words, we attempt to find
µ1
, . . . , µk and c
(1), . . . , c(n)
that minimizes
J(c, µ) = Xn
i=1
||x
(i) − µc
(i) ||2
.
To do so, we iterate between assigning x
(i)
to the nearest cluster center c
(i) and updating
each cluster center µj
to the average of all points assigned to the j
th cluster.
Instead of holding the number of clusters k fixed, one can think of minimizing the objective
function over µ, c, and k. Show that this is a bad idea. Specifically, what is the minimum
possible value of J(c, µ, k)? What values of c, µ, and k result in this value?
(b) (10 pts) To implement our clustering algorithms, we will use Python classes to help us define
three abstract data types: Point, Cluster, and ClusterSet (available in cluster.py). Read
through the documentation for these classes. (You will be using these classes later, so make
sure you know what functionality each class provides!) Some of the class methods are already
implemented, and other methods are described in comments. Implement all of the methods
marked TODO in the Cluster and ClusterSet classes.
(c) (20 pts) Next, implement random_init(…) and kMeans(…) based on the provided specifications.
(d) (5 pts) Now test the performance of k-means on a toy dataset.
Use generate_points_2d(…) to generate three clusters each containing 20 points. (You
can modify generate_points_2d(…) to test different inputs while debugging your code,
but be sure to return to the initial implementation before creating any plots for submission.)
You can plot the clusters for each iteration using the plot_clusters(…) function.
In your writeup, include plots for the k-means cluster assignments and corresponding cluster
“centers” for each iteration when using random initialization.
(e) (10 pts) Implement kMedoids(…) based on the provided specification.
Hint: Since k-means and k-medoids are so similar, you may find it useful to refactor your code
to use a helper function kAverages(points, k, average, init=’random’, plot=True),
where average is a method that determines how to calculate the average of points in a
cluster (so it can take on values ClusterSet.centroids or ClusterSet.medoids).2
As before, include plots for k-medoids clustering for each iteration when using random initialization.
(f) (5 pts) Finally, we will explore the effect of initialization. Implement cheat_init(…).
Now compare clustering by initializing using cheat_init(…). Include plots for k-means
and k-medoids for each iteration.
2
In Python, if you have a function stored to the variable func, you can apply it to parameters arg by callling
func(arg). This works even if func is a class method and arg is an object that is an instance of the class.
3