Description
1 Naive Bayes Classification
In the class, we saw how we can build a naive Bayes classifier for discrete variables. In
this question, you will explore the case when features are not discrete. Suppose instances,
represented by x, are d dimensional real vectors and the labels, represented by y, are either
0 or 1.
Recall from class that the naive Bayes classifier assumes that all features are conditionally
independent of each other, given the class label. That is
p(x|y) = Y
j
p(xj
|y)
Now, each xj
is a real valued feature. Suppose we assume that these drawn from a class
specific normal distribution. That is,
1. When y = 0, each xj
is drawn from a normal distribution with mean µ0,j and standard
deviation σ0,j , and
2. When y = 1, each xj
is drawn from a normal distribution with mean µ1,j and standard
deviation σ1,j
1
Now, suppose we have a training set S = {(xi
, yi)} with m examples and we wish to
learn the parameters of the classifier, namely the prior p = P(y = 1) and the µ’s and the
σ’s. For brevity, let the symbol θ denote all these parameters together.
(a) Write down P(S|θ), the likelihood of the data in terms of the parameters. Write down
the log-likelihood.
(b) What is the prior probability p? You can derive this by taking the derivative of the
log-likelihood with respect to the prior and setting it to zero.
(c) By taking the derivative of the log-likelihood with respect to the µj
’s and the σj
’s,
derive expressions for the µj
’s and the σj
’s.
2 Logistic Regression
We looked maximum a posteriori learning of the logistic regression classifier in class. In
particular, we showed that learning the classifier is equivalent to the following optimization
problem:
min
w
Xm
i=1
log
1 + exp(−yiw
T xi)
+
1
σ
2 w
T w
In this question, you will derive the stochastic gradient descent algorithm for the logistic
regression classifier.
(a) What is the derivative of the function log
1 + exp(−ywT
i xi)
with respect to the weight
vector?
(b) The inner most step in the SGD algorithm is the gradient update where we use a
single example instead of the entire dataset to compute the gradient. Write down the
objective where the entire dataset is composed of a single example, say (xi
, yi). Derive
the gradient of this objective with respect to the weight vector.
(c) Write down the pseudo code for the stochastic gradient algorithm using the gradient
from part (b) above.
3 The EM algorithm
The two local newspapers The Times and The Gazette publish n articles everyday. The
article length in the newspapers is distributed based on the Exponential Distribution with
parameter λ. That is, for an non-negative integer x:
P(wordcount = x|λ) = λe−λx
.
with parameters λT , λG for the Times and the Gazette, respectively.
(Note: Technically, using the exponential distribution is not correct here because the
exponential distribution applies to real valued random variables, whereas here, the word
counts can only be integers. However, for simplicity, we will use the exponential distribution
instead of, say a Poisson.)
(a) Given an issue of one of the newspapers (x1, . . . xn), where xi denotes the length of the
ith article, what is the most likely value of λ?
(b) Assume now that you are given a collection of m issues {(x1, . . . xn)}
m
1 but you do not
know which issue is a Times and which is a Gazette issue. Assume that the probability
of a issue is generated from the Times is η. In other words, it means that the probability
that a issue is generated from the Gazette is 1 − η.
Explain the generative model that governs the generation of this data collection. In
doing so, name the parameters that are required in order to fully specify the model.
(c) Assume that you are given the parameters of the model described above. How would
you use it to cluster issues to two groups, the Times issues and the Gazette issues?
(d) Given the collection of m issues without labels of which newspaper they came from,
derive the update rule of the EM algorithm. Show all of your work.
3