Homework 1 571: Probabilistic Machine Learning

$30.00

Category: Tags: , , , , You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (2 votes)

1 Perceptron Algorithm and Convergence Analysis (40 points)
1. Perceptrons can be used to represent many Boolean functions f : {0, 1}
n → {0, 1} in ndimensional space.
(a) Provide a two-input Boolean function y = f(x1, x2), where y, x1, x2 ∈ {0, 1}, that can
be represented by a single perceptron. Plot the points (x1, x2) that represent all possible pair
values (e.g. (0, 0),(0, 1),(1, 0),(1, 1)) and provide a separating hyperplane on the same figure.
(b) Provide a two-input Boolean function y = f(x1, x2) that can not be represented by a
single perceptron. Briefly explain why.
(c) Provide a three-input Boolean function y = f(x1, x2, x3), where y, x1, x2, x3 ∈ {0, 1},
that can be represented by a single perceptron. Plot the points (x1, x2, x3) that represent all
possible pair values and provide a separating hyperplane on the same figure.
2. For a response variable y ∈ {−1, 1} and a linear classification function f(x) = β0 + β
T x,
suppose that we classify according to sign(f(x)). Show that the signed Euclidean Distance of
the point x with label y to the decision boundary is given by
1
||β||2
yf(x)
3. Suppose we have n points xi ∈ R
p with class labels yi ∈ {−1, 1}. Points are correctly
classified by a hyperplane w
sepT
x = 0. Assuming yiw
sepT
xi ≥ 1 for all i and maxi
||xi
||2 = 1,
show that the perceptron algorithm converges to a separating hyperplane in no more than
||w
(0) − w
sep||2
2
steps.
2 Programming Assignment (60 points)
In this problem set, you will implement two algorithms – Perceptron and Balanced Winnow –
in order to classify the 4 and 9 handwritten digits from the MNIST dataset. Each algorithm
will have 784 inputs (image pixels represented as a column vector) and one output.
For the MNIST dataset (http://yann.lecun.com/exdb/mnist/) filter out all observations
that are not labeled 4 or 9. In order to keep the weights small make sure data values are scaled
1
between 0 and 1. If you are using a random number generator, set the seed of the generator to
2018. Split the data into training and testing datasets.
1. Write a function called perceptron that takes as an input a training set S = {(xi
, yi)}
n
i=1,
where xi ∈ R
d and yi ∈ {−1, 1}, i = 1..n and a maximum number of epochs I. The weights
w are initialized as w = (0, …, 0). The output of the function should be a weight vector w
after convergence to a separating hyperplane or after a maximum number of epochs is reached.
Algorithm 1 contains pseudo code for the perceptron algorithm.
(a). Run the function perceptron on the training set and plot the evolution of the accuracy
versus the epoch counter. What do you observe? What will happen if you increase or decrease
the maximum number of epochs?
(b). Plot the evolution of testing dataset accuracy versus the epoch counter (use the same
figure as in part (a)). What do you observe? How does it compare to the previous curve?
(c). The confusion matrix is a table that allows to visualize the performance of a classifier
on a testing dataset and reports the number of false positives, false negatives, true positives,
and true negatives. In particular one can compute entries of a confusion matrix as follows:
Actual yes: y = +1 Actual no: y = −1
Predicted yes: ˆy = +1 TP FP
Predicted no: ˆy = −1 FN TN
Report the accuracy and confusion matrix of the perceptron algorithm on the testing set
after the last epoch.
(d). Start running the perceptron algorithm and stop after one third of the dataset has been
processed. Save the current weight vector w
0
. Continue running algorithm till the convergence
or a maximum number of epochs is reached and save the weight vector w

. Use w
0 and w

to
create two ROC curves on the training data and display them on the same figure. Based on
ROC curves provide an argument to support a claim that weight vector w

leads to a better
decision boundary.
You may create the ROC curve by plotting the true positive rate (TPR) against the false
positive rate (FPR) at various threshold b setting, where
T P R =
T P
T P + F N
F P R =
F P
F P + T N
To compute the (F P R, T P R) pairs you should modify a classification rule. In particular,
algorithm with an unknown weight vector w and known parameter b classifiers a data point
(x, y) correctly if y(w
T x − b) > 0.
Please do not use build-in functions for ROC computations.
(e). Use any approximation technique (e.g. Riemann sum, trapezoidal rule) to approximate
AUC. Report AUC values for w
∗ and w
0
. How do the values you received correspond to the
ROC curves in part (d)?
2. Write a function called Balanced Winnow that takes as an input a training set S =
{(xi
, yi)}
n
i=1, where xi ∈ Rd and yi ∈ {−1, 1}, i = 1..n and a maximum number of epochs I.
2
Algorithm 1 Perceptron
Input: {(xi
, yi)}
n
i=1
Initialize: w = (0, …, 0), I = 100
for e=1 to I do
for i=1 to N do
if yiw
T xi ≤ 0 then
w = w + yixi
end if
end for
end for
The weights w = [wp, wn] are initialized as w = ( 1
2p
, …,
1
2p
). The output of the function should
be a weight vector w after convergence to a separating hyperplane or after a maximum number
of epochs is reached. Algorithm 2 contains pseudo code for the Balanced Winnow algorithm.
Algorithm 2 Balanced Winnow
Input: {(xi
, yi)}
n
i=1
Initialize: wp = (1/2p, …, 1/2p), wn = (1/2p, …, 1/2p), I
for e=1 to I do
for i=1 to N do
if yi(w
T
p xi − w
T
n xi) ≤ 0 then
wp = wpe
ηyixi
wn = wne
−ηyixi
s =
P
j w
j
n + w
j
p
wp = wp/s
wn = wn/s
end if
end for
end for
(a). Run the function Balanced Winnow on the training set and plot the evolution of the
accuracy versus the epoch counter. Report the accuracy and confusion matrix on the test set.
(b). Based on your experiments, is there a value of the parameter η that is better than the
others? What technique would you use to tune the value of the parameter η?