Sale!

COMS W4903: Machine Learning for Data Science Homework 2

$30.00 $18.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (2 votes)

Problem 1 (written) – 20 points
In this problem you will derive a naive Bayes classifier that you will implement in Problem 2 below.
Recall that for a labeled set of data (y1, x1), . . . ,(yn, xn), where for this problem y ∈ {0, 1} and x is a
D-dimensional vector not necessarily in R
D, the Bayes classifier observes a new x0 and predicts y0 as
y0 = arg max
y
p(y0 = y|π)
Y
D
d=1
pd(x0,d|θ
(d)
y
).
The distribution p(y0 = y|π) = Bernoulli(y|π). What is “naive” about this classifier is the assumption
that all D dimensions of x are independent. Observe that you can pick any distribution pd you think
appropriate for the dth dimension. In this problem, assume that D = 2 and p1 is a Bernoulli distribution
and p2 is a Pareto distribution. That is,
p1(x0,1|θ
(1)
y
) = (θ
(1)
y
)
x0,1
(1 − θ
(1)
y
)
1−x0,1
, p2(x0,2|θ
(2)
y
) = θ
(2)
y
(x0,2)
−(θ
(2)
y +1)
.
The parameter θ
(1)
y ∈ [0, 1] and θ
(2)
y > 0. For the class prior Bernoulli distribution, π ∈ [0, 1]. Given
some data (y1, x1), . . . ,(yn, xn), derive the maximum likelihood solution for π, θ
(1)
y and θ
(2)
y , i.e.,
π, b θb(1)
y
, θb(2)
y = arg max
π,θ(1)
y ,θ(2)
y
Xn
i=1
ln p(yi
|π) +Xn
i=1
ln p(xi1|θ
(1)
yi
) +Xn
i=1
ln p(xi2|θ
(2)
yi
).
(I avoided copying θ for y = 1 and y = 0 above.) Notice that this can be viewed as three independent
subproblems. Please separate your derivations as follows:
(a) Derive πb using the objective above.
(b) Derive θb(1)
y using the objective above. Derive this leaving y arbitrary.
(c) Derive θb(2)
y using the objective above. Derive this leaving y arbitrary.
Problem 2 (coding) – 40 points
In this problem you will implement the naive Bayes classifier derived in Problem 1, as well as the kNN algorithm and logistic regression algorithm. The data consists of examples of spam and non-spam
emails, of which there are 4508 training examples and 93 testing examples. The feature vector x is a
57-dimensional vector extracted from the email and y = 1 indicates a spam email. The data has been
preprocessed by me such that the first 54-dimensions of each observation is binary and the last three
dimensions are positive numbers.1 As with the first homework, you would ideally use cross-validation
on multiple partitions, but I am keeping it simple here with one training and one testing set.
For the naive Bayes classifier, use 54 Bernoulli distributions for the 54 binary dimensions and 3 Pareto
distributions for the last 3 positive dimensions. I choose the Pareto distribution because it is able to model
outliers more easily, which the data seems to have many of. There are other “heavy tail” distributions we
could have chosen as well. (The reason for the Bernoulli distribution should be clear.)
(a) Implement the naive Bayes classifier described above on the training data and make predictions on
the testing data. In a 2 × 2 table, write the number of times that you predicted a class y data point
(ground truth) as a class y
0 data point (model prediction) in the (y, y0
)-th cell of the table, where y
and y
0
can be either 0 or 1. There should be four values written in the table in your PDF. Next to
your table, write the prediction accuracy—the sum of the diagonal divided by 93.
(b) In one figure, show a stem plot (stem() in Matlab) of the 54 Bernoulli parameters for each class.
Use the file “spambase.names” to make an observation about dimensions 16 and 52.
(c) Implement the k-NN algorithm for k = 1, . . . , 20. Plot the prediction accuracy as a function of k.
Finally, you will run logistic regression on the same data set. Set every yi = 0 to yi = −1 for this part.
(d) Implement the steepest ascent algorithm given in Lecture 9. Use an iteration-dependent step size
ηt
, where ηt =
1
105

t+1 . Run your algorithm for 10,000 iterations and plot the logistic regression
objective training function L per iteration. (The pattern should look strange.)
(e) Finally, implement a gradient method called “Newton’s method” as follows: At iteration t, set
w
(t+1) = w
(t) − ηt(∇2
wL)
−1∇wL, ∇2
wL = −
Xn
i=1
σ(x
T
i w)(1 − σ(x
T
i w))xix
T
i ← a matrix
Set ηt = √
1
t+1 for this problem. Plot the objective function L on the training data as a function
of t = 1, . . . , 100. Below the plot give the prediction accuracy on the testing data. We don’t have
time to go through Newton’s method, but hopefully this will motivate you to study it more!