$35.00

Category: CM 146

Description

5/5 - (4 votes)

1 Na¨ıve Bayes over Multinomial Distribution [30 pts]

In this question, we will look into training a na¨ıve Bayes classifier with a model that uses a multinomial distribution to represent documents. Assume that all the documents are written in a language

which has only three words a, b, and c. All the documents have exactly n words (each word can

be either a, b, or c). We are given a labeled document collection {D1, D2, . . . , Dm}. The label yi

of document Di

is 1 or 0, indicating whether Di

is “good” or “bad”.

This model uses the multinominal distribution in the following way: Given the i

th document Di

,

we denote by ai (respectively, bi

, ci) the number of times that word a (respectively, b, c) appears

in Di

. Therefore, ai + bi + ci = |Di

| = n. We define

Pr(Di

|y = 1) = n!

ai

!bi

!ci

!

α

ai

1

β

bi

1

γ

ci

1

where α1 (respectively, β1, γ1) is the probability that word a (respectively, b, c) appears in a “good”

document. Therefore, α1 + β1 + γ1 = 1. Similarly,

Pr(Di

|y = 0) = n!

ai

!bi

!ci

!

α

ai

0

β

bi

0

γ

ci

0

where α0 (respectively, β0, γ0) is the probability that word a (respectively, b, c) appears in a “bad”

document. Therefore, α0 + β0 + γ0 = 1.

(a) (3 pts) What information do we lose when we represent documents using the aforementioned

model?

(b) (7 pts) Write down the expression for the log likelihood of the document Di

, log Pr(Di

, yi).

Assume that the prior probability, P r(yi = 1) is θ.

(c) (20 pts) Derive the expression for the maximum likelihood estimates for parameters α1, β1,

γ1, α0, β0, and γ0.

Submission note: You need not show the derivation of all six parameters separately. Some

parameters are symmetric to others, and so, once you derive the expression for one, you can

directly write down the expression for others.

Grading note: 5 points for the derivation of one of the parameters, 3 points each for the

remaining five parameter expressions.

1

2 Hidden Markov Models [15 pts]

Consider a Hidden Markov Model with two hidden states, {1, 2}, and two possible output symbols,

{A, B}. The initial state probabilities are

π1 = P(q1 = 1) = 0.49 and π2 = P(q1 = 2) = 0.51,

the state transition probabilities are

q11 = P(qt+1 = 1|qt = 1) = 1 and q12 = P(qt+1 = 1|qt = 2) = 1,

and the output probabilities are

e1(A) = P(Ot = A|qt = 1) = 0.99 and e2(B) = P(Ot = B|qt = 2) = 0.51.

Throughout this problem, make sure to show your work to receive full credit.

(a) (5 pts) There are two unspecified transition probabilities and two unspecified output probabilities. What are the missing probabilities, and what are their values?

(b) (5 pts) What is the most frequent output symbol (A or B) to appear in the first position of

sequences generated from this HMM?

(c) (5 pts) What is the sequence of three output symbols that has the highest probability of

being generated from this HMM model?

3 Facial Recognition by using K-Means and K-Medoids [55 pts]

Machine learning techniques have been applied to a variety of image interpretation problems. In

this problem, you will investigate facial recognition, which can be treated as a clustering problem

(“separate these pictures of Joe and Mary”).

For this problem, we will use a small part of a huge database of faces of famous people (Labeled

Faces in the Wild [LFW] people dataset1

). The images have already been cropped out of the original

image, and scaled and rotated so that the eyes and mouth are roughly in alignment; additionally,

we will use a version that is scaled down to a manageable size of 50 by 37 pixels (for a total of 1850

“raw” features). Our dataset has a total of 1867 images of 19 different people. You will explore

clustering methods such as k-means and k-medoids to the problem of facial recognition on this

dataset.

Download the starter files from the course website. It contains the following source files:

• util.py – Utility methods for manipulating data.

• cluster.py – Code for the Point, Cluster, and ClusterSet classes.

• faces.py – Main code

Please note that you do not necessarily have to follow the skeleton code perfectly. We encourage

you to include your own additional methods and functions. However, you are not allowed to use

any scikit-learn classes or functions other than those already imported in the skeleton code.

1

http://vis-www.cs.umass.edu/lfw/

2

We will explore clustering algorithms in detail by applying them to a toy dataset. In particular,

we will investigate k-means and k-medoids (a slight variation on k-means).

(a) (5 pts) In k-means, we attempt to find k cluster centers µj ∈ R

d

, j ∈ {1, . . . , k} and n

cluster assignments c

(i) ∈ {1, . . . , k}, i ∈ {1, . . . , n}, such that the total distance between each

data point and the nearest cluster center is minimized. In other words, we attempt to find

µ1

, . . . , µk and c

(1), . . . , c(n)

that minimizes

J(c, µ) = Xn

i=1

||x

(i) − µc

(i) ||2

.

To do so, we iterate between assigning x

(i)

to the nearest cluster center c

(i) and updating

each cluster center µj

to the average of all points assigned to the j

th cluster.

Instead of holding the number of clusters k fixed, one can think of minimizing the objective

function over µ, c, and k. Show that this is a bad idea. Specifically, what is the minimum

possible value of J(c, µ, k)? What values of c, µ, and k result in this value?

(b) (10 pts) To implement our clustering algorithms, we will use Python classes to help us define

three abstract data types: Point, Cluster, and ClusterSet (available in cluster.py). Read

through the documentation for these classes. (You will be using these classes later, so make

sure you know what functionality each class provides!) Some of the class methods are already

implemented, and other methods are described in comments. Implement all of the methods

marked TODO in the Cluster and ClusterSet classes.

(c) (20 pts) Next, implement random_init(…) and kMeans(…) based on the provided specifications.

(d) (5 pts) Now test the performance of k-means on a toy dataset.

Use generate_points_2d(…) to generate three clusters each containing 20 points. (You

can modify generate_points_2d(…) to test different inputs while debugging your code,

but be sure to return to the initial implementation before creating any plots for submission.)

You can plot the clusters for each iteration using the plot_clusters(…) function.

In your writeup, include plots for the k-means cluster assignments and corresponding cluster

“centers” for each iteration when using random initialization.

(e) (10 pts) Implement kMedoids(…) based on the provided specification.

Hint: Since k-means and k-medoids are so similar, you may find it useful to refactor your code

to use a helper function kAverages(points, k, average, init=’random’, plot=True),

where average is a method that determines how to calculate the average of points in a

cluster (so it can take on values ClusterSet.centroids or ClusterSet.medoids).2

As before, include plots for k-medoids clustering for each iteration when using random initialization.

(f) (5 pts) Finally, we will explore the effect of initialization. Implement cheat_init(…).

Now compare clustering by initializing using cheat_init(…). Include plots for k-means

and k-medoids for each iteration.

2

In Python, if you have a function stored to the variable func, you can apply it to parameters arg by callling

func(arg). This works even if func is a class method and arg is an object that is an instance of the class.

3

WhatsApp us