CS 4780/5780 Assignment 3: Perceptron and SVM

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (3 votes)

Problem 1: Perceptron and Margins [35 points]
In this problem we will be working exclusively with a small dataset S in Figure 1. Consider all red (diamond) training examples as negative instances (y = −1) and all blue
(cross) training examples as positive instances (y× = +1). We will explore two variations
on the standard perceptron algorithm you learned in class, and relate the perceptron to
a hard-margin SVM.
(a) In class you learned the perceptron algorithm for finding an unbiased hyperplane that
separates two classes with 0 training error. There is no such hyperplane, however, for
the dataset S in Figure 1. A simple way to incorporate a bias term b is to augment
the weight vector w, such that wbiased = hw, bi, and augment all training examples
such that xi
:= hxi
, 1i. Implement a biased perceptron learning algorithm and train
it on S, following the order of training examples given in Figure 1. Compute wbiased.
Plot the resulting hyperplane. Does the resulting hyperplane maximize the margin
between the two classes? (Note for all perceptron implementations in this problem:
3-1
initialize your weight vector w to the zero vector.)
(b) Sketch a hyperplane that achieves the largest hard margin in S without using an
SVM package. Clearly mark all support vectors in your diagram. What is the
weight vector wopt and bias bopt of the maximum margin hyperplane that achieves a
functional margin of 1 for the support vectors? What is the geometric margin γopt?
Also indicate which dual variables αi are non-zero. Show all work leading you to
your answer.
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
8
9
10
1 2
3
4
5
6
Figure 1: Training set S
(c) All of a sudden you get a strong and unstoppable urge to implement your own SVM.
Unfortunately you have no background in convex optimization, so you decide to try
something very simple instead. Being clever, you think up a way to modify the
biased perceptron algorithm to allow you to ‘tune’ its margin. One simple way to
accomplish this is to update the weight vector whenever a training example is within
a user-specified fraction β ∈ [0, 1) of the maximum margin γ.
yi(wT xi + b)
kwk
< βγ (1)
Set β = 0.5 and run your modified perceptron algorithm until convergence (train
3-2
on the examples in the same order as indicated in Figure 1). Plot the resulting
hyperplane (you found γ in part b). What do you observe?
(d) Rerun your algorithm until convergence for a range of values β ∈{0.1, 0.2, 0.3, 0.4,
0.5, 0.6, 0.7, 0.8, 0.9, 0.95}. Plot (on one graph) a hyperplane for each value of β
(10 values total) overlaying the optimal hyperplane you found in part b.
(e) What do you observe? What do you expect to happen if you rerun this modified
perceptron learning algorithm with different initializations of the w vector, and different ordering of the training examples, for different values of β, as β approaches 1?
You don’t need to perform the experiments, but justify your answer.
In the rest of the problem we will try to prove the mistake bound of this modified
perceptron algorithm with β = 0.5. Each remaining part of the problem is designed to
lead you to the final bound. If you are unable to complete any single part, you can use
the provided result as a given for the parts that follow.
Assume that all datapoints are rescaled into a ball of radius 1, so that R2 = 1 in all of
the bounds you have seen in class. This does not affect the generality of the proofs.
(i) You have already seen in class, that we can bound the squared length of the weight
vector in the standard perceptron on every mistake update as follows:
||wk+1||2 ≤ ||wk||2 + 1
Show that this bound implies the following bound on the length of the weight
vector in the standard perceptron:
||wk+1|| ≤ ||wk|| +
1
2||wk||
(ii) Now let’s modify the bound above to account for a modified perceptron learning
rule specified in part c of this problem. Show that for this modified perceptron
learning algorithm, the following bound on the length of the weight vector holds:
||wk+1|| ≤ ||wk|| +
1
2||wk|| +
γ
2
Hint 1 It may be helpful to recall that ||wk+1||2 = ||wk + yx||2 and that a squared
norm of any vector ||v||2
can be written as v · v. Additionally, the following inequality may be useful: √
1 + x ≤ 1 + x
2
. In contrast to the perceptron update rule
you saw in class, you need to consider how the new update rule (that incorporates
the margin) will affect the upper bound.
3-3
Hint 2 You may also use geometric intuition instead, and show that the growth in
||wk+1|| is also affected by a nonzero projection of yx on w under the new update
rule, and use that to help you derive the upper bound.
(iii) Show that the following simpler bound holds if ||wk|| ≥ 2
γ
:
||wk+1|| ≤ ||wk|| +

4
(iv) Finally show that the total number of updates (mistakes) K for this modified
perceptron algorithm can be bounded by K ≤ 8/γ2
. Use the lower bound on ||wk||
together with the upper bound above to arrive to the desired upper bound on K.
Notice that the lower bound on ||wk|| is the same as for the perceptron algorithm
you analyzed in class.
(v) Using the same approach for obtaining the bound on the number of mistakes for
β = 0.5, what happens to the bound as β approaches 1? You will need to rework the
bound for a general user-specified β in this modified perceptron algorithm. Does
this agree/disagree with your findings in part d?
Problem 2: Multi-class classification with SVMs [35 points]
In this problem, you will use binary SVMs to classify online newsgroup posts into
1 of 4 categories: AUTO, COMPUTERS, RELIGION, SPORTS. The two datafiles
groups2.train and groups2.test have been slightly modified from HW2. All training examples are the same, except that all features were binarized, i.e., were set to 1 if their
value was greater than zero. Again, each of the files contains 2000 examples with feature
size 2000.
Since we have more than two classes, using a single binary SVM is not sufficient. Instead,
we are going to combine several binary SVM classifiers to tackle this problem. Thus, our
goal is to train multiple linear SVM classifiers predicting each class in isolation, and
combine them to properly predict one of four classes.
We train linear SVM classifiers with SV Mlight (http://svmlight.joachims.org). It provides you with both learning and predicting modules with instructions on the website. Please make sure that what you are using is SV Mlight instead of SV Mstruct
,
SV Mmulticlass, or other variations.
To do multi-class classification with SV Mlight, first train four different soft-margin linear
classifiers. Note that you have to replace the target class number to 1 and all others
3-4
to -1 before using the library. For instance, if you learn a classifier for RELIGION, you
use 1,000 articles about politics as positive samples and 3,000 others as negative samples
for training. Once you learned four classifiers, the output label of a test example x is
determined choosing the classifier that yields the greatest margin:
h(x) = argmax
y∈{1,2,3,4}
wy · x + by
where wy is the learned weight vector and by is the biased term for class y. Break ties by
choosing the class with the smaller number. Note also, that svm_classify will output
exactly these values for each test example, so you do not need to compute them by
hand.
a) Randomly partition the training set into 5 equal-sized folds. Then, do 5-fold crossvalidation for C ∈ {0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 100.0}. Use the same C-value
and folds for each binary SVM. Make sure that the only parameter you use with
svm_learn is -c.
If the prediction h(x) is different from the ground-truth class, we count it as an error.
Plot a graph showing both training and validation accuracies (averaged across the 5
folds) together with C in log-scale (i.e., x-axis: log2 C, y-axis: accuracy). Provide also
a table showing these numbers. What value of C gives the best validation accuracy?
b) Now, re-train your four binary SVMs on the complete training set using the optimal
C from a). Report the resulting training and test accuracies.
c) A standard preprocessing technique often used in practice is length normalization.
There, each feature vector x gets rescaled so that kxk2 = 1. Now, normalize all
training and test instances and repeat part a) and b) for the new data. Provide the
same plots and tables.
d) Compare the test error from c) to previous test error from b) without normalization,
and explain why normalization makes a difference (Hint: Think about what happens
to the slack variables.). Why did we have to reestimate the optimal value for C?
Relate your answers to the SVM objective.
Problem 3: Skewed datasets [30 points]
In practice, you will often have to deal with datasets that do not contain an equal number
of positive and negative training examples. These skewed (or unbalanced) datasets can,
for example, occur in clinical studies where patients with a rare form of cancer are the
positive examples and patients with no abnormalities are the negative examples. In such
situations, it is virtually impossible to obtain more positive examples. Another example
is spam classification where users usually receive more ham than spam.
3-5
In this problem, you will investigate various problems with skewed datasets are and learn
a way of mitigating them.
a) Suppose you know for a binary classification problem that P(Y = 1) = 0.07 in both
the test and training set. Describe a simple rule that would give you at least 90%
classification accuracy on the test set. What is the major problem with that rule?
b) You are given two synthetic data files boxes.train and boxes.test, for training and
testing respectively. The training dataset contains 200 examples, 100 positive and
100 negative ones, the test dataset ten times as many samples. The positive examples
stem from a uniform distribution f(x, y) with 0 ≤ x, y ≤ 0.5 and the negative ones
from a uniform distribution g(x, y) with 0.5 < x, y ≤ 1.0.
Let Snneg denote the dataset containing all positive training examples but only the
first nneg negative training examples (in the order they appear in the file). For
example, S30 would contain 130 examples in total, 100 positive examples and the first
30 negative ones.
Train a soft-margin linear classifier with SV Mlight on Snneg for all combinations of
nneg ∈ {100, 50, 10} and C ∈ {1, 1000}. Make sure that the only parameter you set
for svm_learn is C (option -c). For each combination, plot the resulting dataset and
the solution of the SVM, i.e. the optimal hyperplane. Use plus signs and circles as
markers for positive and negative instances respectively.
c) Look at the plots of part b). How does the location of the optimal hyperplane change
for C = 1 as we decrease the number of negative instances? How does it change for
C = 1000? Provide an explanation for both observations. Relate your answers to the
primal optimization objective of soft-margin SVMs.
d) One way of overcoming this issue is using custom weights in the optimization objective:
min
w,b,ξ
1
2
w · w + C · j
X
M
i:yi=1
ξi + C
X
M
i:yi=−1
ξi (2)
subject to ∀k : yk(w · xk + b) ≥ 1 − ξk , ξk ≥ 0 (3)
Suppose we set
j =
# negative training examples
# positive training examples .
Assume that we always have at least one positive training example so that j is welldefined. Give a quantitative argument for why this choice is reasonable given no other
information about the classes.
e) Run SV Mlight on S10 using C = 1 and j ∈ {0.5, 0.1, 0.05} and give plots (same as in
b)) for each value of j.
How does the position of the hyperplane change when j gets decreased? Why does
that happen? Relate your answer to the optimization objective in d).
3-6
f) Now run each of the three models of part e) on boxes.test and report overall accuracy,
as well as the number of false positives and false negatives for each model.
How does the number of false negatives / false positives behave when j gets decreased?
Relate your findings to the plots in e).
3-7