## Description

1. You have two feature vectors x1 =

”

1

1

#

, x2 =

”

−1

1

#

and corresponding labels

d1 = −1, d2 = 1. The linear classifier is sign{x

T

i w} where w =

”

−1

0.5

#

.

a) Find the squared error loss for this classifier.

b) Find the hinge loss for this classifier.

2. You have four data points x1 = 2, x2 = 1.5, x3 = 1/2, x4 = −1/2 and corresponding

labels y1 = 1, y2 = 1, y3 = −1, y4 = −1.

a) Find a maximum margin linear classifier for this data. Hint: Graph the data.

b) Use squared-error loss to train the classifier (with the help of Python). Does this

classifier make any errors?

c) Find a classifier with zero hinge loss. Hint: Use what you’ve learned about hinge

loss, not computation. Does this classifier make any errors?

d) Now suppose x4 = −5. Use squared-error loss to find the classifier (with the help

of Python). Does this classifier make any errors?

e) Can you still find a classifier with zero hinge loss when x4 = −5? Does it make

any errors?

3. Previously, we examined the performance of classifiers trained using the squared error

loss function (i.e, trained using least squares). This problem uses an off-the-shelf Linear

Support Vector Machine to train a binary linear classifier.

The data set is divided into training and test data sets. In order to represent a decision

boundary that may not pass through the origin, we can consider the feature vector

x

T =

x1 x2 1

.

a) Classifier using off the shelf SVM. Code is provided to train a classifier using

an off the shelf SVM with hinge loss. Run the code to find the linear classifier

weights. Next, uses the weights to predict the class of the test data. How many

classification errors occur?

b) Comment out the code that trains the classifier using the linear SVM, and uncomment the code that train the classifier using least squares (i.e, wopt = (XTX)

−1XT y).

How many errors occur on the test set?

c) Training a classifier using the squared error as a loss function can fail when

correctly labeled data points lie far from the decision boundary. Linear SVMs

trained with hinge loss are not susceptible to the same problem. A new dataset

consisting of the first dataset, plus 1000 (correctly labeled) datapoints at x1 =

0, x2 = 10 is created. What happens to the decision boundary when these new

data points are included in training the linear SVM?

d) How does this compare with the error rate of the linear classifier trained with the

new data points? Why is the such a difference in performance?