CSE512 – Homework 3 Primal and Dual of Kernel SVM

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (4 votes)

1.1 Question 1: Primal and Dual of Kernel SVM
Quadratic programs refer to optimization problems in which the objective function is quadratic and the
constraints are linear. Quadratic programs are well studied in optimization literature, and there are many
efficient solvers. Many Machine Learning algorithms are reduced to solving quadratic programs, as you
already saw in the previous assignments. In this question, we will use the quadratic program solver of
Matlab to optimize the dual objective of a kernel SVM.
The primal objective of a kernel SVM without the bias b can be written as
min
w∈H
1
2
kwk
2
2 + C
Xm
j=1
`(w, φ(xj ), yj ) . (1)
Here `(w, xj , yj ) is the Hinge loss of the j-th instance:
`(w, xj , yj ) = max (1 − yj hw, φ(xj )i, 0) .
The corresponding dual objective is
max
α
Xm
j=1
αj −
1
2
Xm
i=1
Xm
j=1
yiαiyjαjk(xi
, xj ) (2)
0 ≤ αj ≤ C ∀j . (3)
(a) Write the SVM dual objective as a quadratic program. Look at the quadprog function of Matlab, and
write down what H, f, A, b, Aeq, beq, lb, ub are.
(b) From the Lagrangian of the above problem, show that w =
Pm
i=1 αiyiφ(xi).
1.2 Question 2: Implement a kernel SVM using Quadratic Programming
(a) Using the answer to part (a) in Question 1, use quadratic programming to write a function that solves
the dual SVM objective. In Matlab, you can use the function quadprog. The prototype must be
alpha = train_ksvm_dual(X, y, C, kernel, gamma)
where kernel is ‘gaussian’ for Gaussian kernels and ‘linear’ for linear kernels, gamma is the parameter
of the Gaussian kernel.
(b) Using the answer to part (b) in Question 1, write a function to test the solution obtained above,
producing a list of predictions from the inputs α, the training data, and the test samples. The
prototype must be
ypredicted = test_ksvm_dual(alpha, Xtr, ytr, Xte, kernel, gamma)
where kernel is ‘gaussian’ for Gaussian kernels and ‘linear’ for linear kernels, gamma is the parameter
of the Gaussian kernel.
(c) Set C = 10, train and test your SVM implementation with linear and Gaussian kernel on the provvided
dataset and report the accuracy, the objective value, and the number of support vectors.
(d) Repeat the above question with C = .1.
2
1.3 Question 3: Implement Linear SVM using Sub-Gradient Descent
In class we saw that Sub-Gradient Descent can be used to optimize a convex and not differentiable objective
function. Here, we will use it to optimize the SVM primal objective in (1) for the linear kernel.
We can use sub-gradient descent to optimize this objective. Denote by L(w) the function 1
2
kwk
2
2 +
C
Pm
j=1 `(w, xj , yj ). The update rule for w at iteration t is
wt+1 ← wt − ηt∂wL(wt), (4)
where ∂wL(wt) denotes the sub-gradient of L w.r.t. w evaluated in wt. The pseudo-code is in Algorithm 1.
Algorithm 1 Sub-gradient descent for linear SVM
w1 = 0
for t = 1, 2, · · · , T do
ηt ← a
t
Update w using Eq. (4)
end for
(a) Write the sub-gradient descent update rule for w for linear SVMs.
(b) Implement Algorithm 1 for linear SVMs. a is a tunable parameter. The prototype must be
[w] = train_svm_sgd(X, y, C, a, T)
(c) The theory in [1] tells us that for this optimization problem the optimal (worst-case) setting of the
learning rate is ηt =
1
t
, that is setting a = 1, but if you are free to choose another learning rate if you
want. Using the provvided dataset as your training set, run T = 10000 iterations using C = 10. Be
patient: it might take a while to compute.
(d) Plot the value of the objective function in Eq. (1) after each iterations in a log-log plot, to better show
the behaviour. Compare with the objective value obtained in Question 2.
(e) Plot the training error after each iteration, on a normal plot, and compare it with the results in
Question 2.
(f) Plot the test error after each iteration, on a normal plot, and compare it with the results in Question
2.
(g) Change C to .1 and repeat what you did in the previous four points.
1.4 Question 4: Invariance to Additive Constants in Kernels
Consider the soft-margin kernel SVM formulation, that includes the bias term b:
min
w,b
1
2
kwk
2
2 + C
Xm
j=1
max(1 − yj (hw, φ(xj )i + b), 0) .
Its dual formulation is
max
α
Xm
j=1
αj −
1
2
Xm
i=1
Xm
j=1
yiαiyjαjk(xi
, xj ) (5)
Xm
j
yjαj = 0 (6)
0 ≤ αj ≤ C ∀j . (7)
3
In this question we will investigate the effect of adding a constant to the kernel function. In particular,
prove that optimal value of both objective functions and the optimal solution w∗ does not change if we add
a constant c to the kernel, i.e. k(·, ·) → k(·, ·) + c.
Hint: Starts from the dual formulation and see what happens when the kernel changes as described. The
proof is short…
2 Kernel Ridge Regression
Consider solving the ridge regression problem with a training set
S = {(x1, y1),(x2, y2), · · · ,(xm, ym) ∈ R
d × R} .
Let wλ be the ridge regression solution with regularization parameter λ, i.e.
wλ = arg min
w∈Rd
λ
2
kwk
2
2 +
1
m
ky − Xwk
2
2
In class and in HW2, we derived a closed form expression for wλ in terms of the input matrix X and label
vector y, defined as
X =





— x1 —
— x2 —
.
.
.
— xm —





∈ R
m×d Y =





y1
y2
.
.
.
ym





∈ R
m .
This closed form expression turns out to be
wλ = (X>X + mλI)
−1X>y .
We will derive an alternative form for wλ without using the duality nor the Representer Theorem.
2.1 Question 5: Alternative Formulation for the Solution of Ridge Regression
First, note that the above closed form expression comes from solving the optimality equation for wλ, i.e.
(X>X + mλId)wλ = X>y,
where Id is the identity matrix of size d × d. The above equation can be rewritten as
mλwλ = X>(y − Xwλ) . (8)
Now, define α := y − Xwλ.
(a) Using equation (8) and the definition of α, show that α satisfies the following equation:
1
mλXX>α + α = y . (9)
(b) Solve equation (9) for α and find an explicit expression of α as a function of y, X, m, and λ.
(c) Use the expression of α at the previous point to show that the following alternative expression for wλ
is valid:
wλ = X>(XX> + mλIm)
−1y . (10)
4
2.2 Question 6: Implement Kernel Ridge Regression
Let wλ be the solution of the ridge regression problem with regularization parameter λ when we use a kernel
k corresponding to a feature mapping φ.
(a) Implement a function that finds α for the Kernel Ridge Regression formulation with kernels, using the
expression found in the previous question. The prototype must be
alpha = train_krr(X, y, lambda, kernel, gamma)
where kernel is ‘gaussian’ for Gaussian kernels and ‘linear’ for linear kernels, gamma is the parameter
of the Gaussian kernel.
Hint: Observe that the matrix XX> can be computed using the kernel function k.
(b) Given a matrix of test points Xte ∈ R
mte×d
, use equation (10) to implement a Matlab function that
computes the prediction using the ridge regression solution expressed in terms of α. The prototype
must be
ypredicted = test_krr(alpha, Xtr, ytr, Xte, lambda, kernel, gamma)
where kernel is ‘gaussian’ for Gaussian kernels and ‘linear’ for linear kernels, gamma is the parameter
of the Gaussian kernel.
(c) Train and test your implementations of Kernel Ridge Regression with the provvided dataset with
λ = 2e − 05, Gaussian and linear kernel. Report training and test error.
(d) Repeat the above with λ = 0.002.
References
[1] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In Proc.
of ICML, pages 807–814, 2007.
5