Homework 4 – Machine Learning CS4342

$30.00

Category: Tags: , , , , You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (3 votes)

1. Support Vector Machines: Implementation [40 points]: In this problem you will implement a
class called SVM4342 that supports both training and testing of a linear, hard-margin support vector
machine (SVM). In particular, you should flesh out the two methods fit and predict that have the
same API as the other machine learning tools in the sklearn package.
(a) fit: Given a matrix X consisting of n rows (examples) by m columns (features)1 as well as a
vector y ∈ {−1, 1}
n consisting of labels, optimize the hyperplane normal vector w and bias term
b to maximize the margin between the two classes, subject to the constraints that each data point
be on the correct side of the hyperplane (hard-margin SVM). To do so, you should harness an
off-the-shelf quadratic programming (QP) solver from the cvxopt Python package. The template
file already includes code showing how to use it – just pass in the relevant matrices and vectors
as numpy arrays. The trick is in setting up the variables to encode the SVM objective and
constraints correctly, as described in lecture. Then, store the optimized values in self.w and
self.b because you’ll need them later.
(b) predict: Given a matrix X of test examples, and given the pre-trained self.w and self.b,
compute and return the corresponding test labels.
Your code will be evaluated based both on the accuracy of the predicted test labels in several tests
(you can see the exact test code we’ll be running in the template file), as well as the difference between
the hyperplane parameters compared to their ideal values.
Note: as shown in the template, you can calculate the ideal values using the off-the-shelf sklearn SVM
solver. Note that, since the SVM in this package is, by default, a soft-margin SVM, we have to “force”
it to be hard-margin by making the cost penalty C very large.
2. Support Vector Machines: Application [30 points]: Register for the following Kaggle competition about predicting which bank customers will make a transaction:
https://www.kaggle.com/c/santander-customer-transaction-prediction Read the Overview section to learn what the competition is about. Download the training data and associated labels from
the Data tab. Train SVMs using two different kernels: linear and polynomial (I suggest degree 3, but
it’s your choice). Use sklearn’s implementation of the SVC class. Use half the training data (randomly
selected from the entire training set) for training, and use the remaining half to estimate the performance of your classifiers. You are not required to submit your guesses for the test examples; however,
if you want to, you can earn a spot on the competition’s leaderboard. Important note: You should
compute the SVM’s predictions using svm.decision function, not svm.predict. The former gives
w>x + b, i.e., a real number, whereas the latter gives sgn(w>x + b), i.e., a binary number. When
using AUC as the accuracy metric, we typically use real-valued predictions because they have more
information.
Bootstrap aggregation (bagging): In order to train an SVM with a non-linear kernel on a dataset
of this size (the whole training dataset contains 200K examples; hence, your training data will consist
of 100K examples) on an ordinary desktop/laptop computer (i.e., without terabytes of RAM), you will
need to use an approach such as bagging: (a) split the data into multiple subsets; (b) train an SVM
on each subset; and (c) create an ensemble classifier that averages the predictions of the individual
classifiers in the ensemble. Report your estimated accuracy (AUC) using both your SVMs; use the
sklearn.metrics.roc auc score function. You can just include the accuracy as a comment at the
top of your Python file. We might run your code to reproduce this result; hence, make sure you only
use sklearn, pandas, and numpy – no other packages.
1To be consistent with the notation I’ve used in the lecture slides, X in this assignment would actually need to be called
X>. The reason I call it X here is to be consistent with the sklearn package.
1
Submit your Python code in a file called homework4 WPIUSERNAME1.py
(or homework4 WPIUSERNAME1 WPIUSERNAME2.py for teams).
2