Homework 2, CSCI 5525

$30.00

Category: Tags: , , , , You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (3 votes)

1 Support vector machine implementation using Convex Optimization
1. (a) (15 pts) Implement a linear SVM with slack variables in dual form. See the
separate file optimizers.txt on how to install and call the optimizers. Write
a prediction function that can be used to “predict” an input point, that is,
compute the functional margin of the point. Look in the file svmtest for a
simple skeleton that shows how to read files and plot results.
(b) (15 pts) Run your SVM implementation on the MNIST-13 dataset. The dataset
contains two classes labeled as 1 and 3 in the first column of the csv file we
provided. All other columns are data values. Compute test performance using
10-fold cross-validation on random 80-20 splits. Report your results and explain
what you find for the following manipulations.
i. Try C = 0.01, 0.1, 1, 10, 100 and show your results, both graphically and by
reporting the number of mistakes on the training and validation data sets.
ii. What is the impact of test performance as C increases?
iii. What happens to the geometric margin 1/||w|| as C increases?
iv. What happens to the number of support vectors as C increases?
(c) (10 pts) Answer the following questions about the dual SVM approach:
i. The value of C will typically change the resulting classifier and therefore
also affects the accuracy on test examples.
ii. The quadratic programming problem involves both w, b and the slack variables ξi
. We can rewrite the optimization problem in terms of w, b alone.
This is done by explicitly solving for the optimal values of the slack variables
ξi = ξi(w, b) as functions of w, b. Then the resulting minimization problem
over w, b can be formally written as:
1
2
||w||2 + C
X
i=1:N
ξi(w, b)
1
where the first (regularization) term biases our solution towards zero in the
absence of any data and the remaining terms give rise to the loss functions,
one loss function per training point, encouraging correct classification. The
values of these slack variables, as functions of w, b, are “loss-functions”. a)
Derive the functions ξi(w, b) that determine the optimal ξi
. The equations
should take on a familiar form. b) Are all the margin constraints satisfied
with these expressions for the slack variables?
1.1 Procedure
Use the code templates provided in the moodle, and any additional instructions
posted by the TA.
1.2 Reporting
Code: You will have to submit code for myDualSVM(filename, C) (main file). This
main file has input: (1) a filename (including extension and absolute path) containing
the dataset and (2) the value of the C parameter: (1) the average test error and
standard deviations printed to the terminal (stdout). The function take the inputs
in this order and display the output via the terminal. The filename will correspond
to a plain text file for a dataset, with each line corresponding to a data point: the
first entry will be the label and the rest of the entries will be feature values of the
data point. The data for the plots can be written in a file tmp in the same directory,
but you do not have to explain how you did the plots from this data, and can use an
iPython notebook to . You can submit additional files/functions (as needed) which
will be used by the main file. Put comments in your code so that one can follow the
key parts and steps in your code.
Summary of methods and results Describe your algorithm, and create plots that
show the number of support vectors vs. C and test performance vs. C, with standard
deviations.
2 Support vector machine implementation on Primal problem using Stochastic Gradient Descent
The goal is to evaluate the performance of optimization algorithms for SVMs which work on
the primal. For both these problems, I EXPECT you to work in groups of two – please name
your teammate and include for your submission. The first part of this problem is based on
following paper: “Pegasos: Primal Estimated sub-GrAdient SOlver for SVM,” by S. ShalevShawtrz, Y. Singer, and N. Srebro. Please read Section 2 of the paper. The algorithm
2
discussed in the paper is a (stochastic) mini-batch (projected) sub-gradient descent method
for the SVM non-smooth convex optimization problem. For the experiments, we will
assume b = 0 for the affine form f(x) = wT x + b (you can see related discussions in
Sections 4 and 5).
The second part of the homework is an experimental attempt to use a softened hinge
loss called ”soft-plus” to create an analytically computable gradient. Recall that the primal
problem has the form:
L(w) = 1
N
X
i=1:N
max(0, 1 − yiwT xi) + λ||w||2
We can use the softplus function softplusa(x) = a log(1 + exp(x/a)), for which
max(0, x) = lima→0 softplusa(x). The softplus function has an analytic derivative, which
will allow deriving a straightforward gradient descent algorithm.
For the homework, do the following:
1. (30 points) Implement the Pegasos algorithm and evaluate its performance on the
MNIST-13 dataset. The evaluation will be based on the optimization performance,
and not classification accuracy. So, we will use the entire dataset for training (optimization). For the runtime results, run the code 5 times and report the average
runtime and standard deviation. The runs will be repeated with different choices of
mini-batch size (see below).
2. (30 points) Implement the gradient descent algorithm and evaluate its performance
on the MNIST-13 dataset. You will need to do two things. First derive the gradient.
∇L(w) = 1
N
X
i=1:N

a log(1 + exp
(1 − yiwT xi)/a
+ λ||w||2

with respect to w using the chain rule. Second, implement your gradient descent into
stochastic gradient descent code.
Evaluation will be based on the optimization performance, and not classification
accuracy. So, we will use the entire dataset for training (optimization). For the
runtime results, run the code 5 times and report the average runtime and standard
deviation. The runs will be repeated with different choices of epoch size (see below).
You will have to submit (i) summary of methods and results report, and (ii) code for
each algorithm:
Summary of methods and results: Pegasos works with a mini-batch At of size
k in iteration t. When k = 1, we get (projected) stochastic gradient descent (SGD),
and when k = n, we get (projected) gradient descent (GD), since the entire dataset
is considered in each iteration. For the runs, we will consider the following values of
k: (a) k = 1, (b) k = 20 (1% of data), (c) k = 200 (10% of data), (d) k = 1000 (50%

of data), and (e) k = 2000 (100% of data). For fixed percent mini-batches, pick the
corresponding percentages from each class, e.g., for 10%, pick 10% of samples from
each of the two classes. For each choice of k as above, report the average runtime
of Pegasos over 5 runs on the entire dataset, along with the standard deviation.
For each choice of k, plot the primal objective function value for each run with
increasing number of iterations. Thus, there will be a figure with 5 plots (different
runs) overlayed for each choice of k, and a separate figure for each k. For your
softplus code, match the minibatch structure of your Pegasos code. For the softplus
runs, use values of k: (a) k = 1, (b) k = 20 (1% of data), (c) k = 200 (10% of data),
(d) k = 1000 (50% of data), and (e) k = 2000 (100% of data). For fixed percent
mini-batches, pick the corresponding percentages from each class, e.g., for 10%, pick
10% of samples from each of the two classes. For each choice of k as above, report
the average runtime of softplus over 5 runs on the entire dataset, along with the
standard deviation. For each choice of k, plot the primal objective function value for
each run with increasing number of iterations. Thus, there will be a figure with 5
plots (different runs) overlayed for each choice of k, and a separate figure for each k.
Thus, there will be a figure with 5 plots (different runs) overlayed for each choice of
k, and a separate figure for each k.
Termination condition: Please use ktot – the total number of gradient computations – as the termination condition. For this question, set ktot = 100n, where n is
your data size (number of data points). The algorithm terminates if ktot is reached.
Code: You will have to submit code for (1) myPegasos(filename, k, numruns) (main
file), and (2) mySoftplus(filename, m, numruns) (main file). The main files have
input: a filename (including extension and absolute path) containing the dataset, (2)
mini-batch size k for Pegasos and softplus, and (3) the number of runs, and output:
(1) the average runtime and standard deviations printed to the terminal (stdout).
The functions must take the inputs in this order and display the output via the
terminal. The filename will correspond to a plain text file for a dataset, with each
line corresponding to a data point: the first entry will be the label and the rest of the
entries will be feature values of the data point. The data for the plots can be written
in a file tmp in the same directory, but (i) we will not be using this data for doing
plots, and (ii) you do not have to explain how you did the plots from this data. You
can submit additional files/functions (as needed) which will be used by the main file.
Put comments in your code so that one can follow the key parts and steps in your
code.
Instructions: Follow the rules strictly. If we cannot run your functions, you get 0
points. Things to submit hw2.pdf: The report that contains the solutions to Problems
1,2, and 3, including the summary of methods and results. myDualSVM: Code
for Question 1. myPegasos and mySoftplus: Code for Question 2. README.txt:
README file that contains your name, student ID, your partner’s name and ID, your
4
email, instructions on how to your run program, any assumptions you?re making, and
any other necessary details. Any other files, except the data, which are necessary for
your program.
5