EECS545 Machine Learning Homework #3

$30.00

Category: Tags: , , , , You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (3 votes)

1 [15 points] MAP estimates and weight decay
Consider using a logistic regression model p(y = 1|x; w) = g(wT x) where g is the sigmoid function, and let a
training set {(x
(i)
, y(i)
);i = 1, …, N} be given as usual. The maximum likelihood estimate of the parameters
w is given by
wML = arg max
w
Y
N
i=1
p(y
(i)
|x
(i)
; w). (1)
If we wanted to regularize logistic regression, then we might put a prior on the parameters. Suppose we
chose the prior w ∼ N (0, τ 2
I) (here, τ > 0, and I is the M + 1-by-M + 1 identity matrix), and then found
the MAP estimate of w as:
wMAP = arg max
w
p(w)
Y
N
i=1
p(y
(i)
|x
(i)
; w). (2)
Prove that
kwMAPk2 ≤ kwMLk2 (3)
2 [25 + 4 points] Direct construction of valid kernels
In class, we saw that by choosing a kernel k(x, z) = φ(x)
T φ(z), we can implicitly map data to a high
dimensional space, and have the SVM algorithm work in that space. One way to generate kernels is to
explicitly define the mapping φ to a higher dimensional space, and then work out the corresponding k.
However in this question we are interested in direct construction of kernels. I.e., suppose we have a
function k(x, z) that we think gives an appropriate similarity measure for our learning problem, and we are
considering plugging k into a kernelized algorithm (like SVM) as the kernel function. However for k(x, z) to
be a valid kernel, it must correspond to an inner product in some higher dimensional space resulting from
some feature mapping φ. Mercer’s theorem tells us that k(x, z) is a (Mercer) kernel if and only if for any
finite set
x
(1)
, . . . , x
(N)

, the matrix K is symmetric and positive semi-definite, where the square matrix
K ∈ R
N×N is given by Kij = k

x
(i)
, x
(j)

Now here comes the question: Let k1, k2 be kernels over R
D × R
D, let a ∈ R
+ be a positive real number,
let f : R
D → R be a real-valued function, let φ : R
D → RM be a function mapping from R
D to RM, let k3
be a kernel over RM × RM, and let p : R → R be a polynomial with positive coefficients.
For each of the functions k below, state whether it is necessarily a kernel. If you think it is, please prove
it; if you think it is not, please give a counterexample.
(a) [3 points] k(x, z) = k1(x, z) + k2(x, z)
(b) [3 points] k(x, z) = k1(x, z) − k2(x, z)
(c) [3 points] k(x, z) = ak1(x, z)
(d) [3 points] k(x, z) = −ak1(x, z)
(e) [4 points] k(x, z) = k1(x, z)k2(x, z)
(f) [3 points] k(x, z) = f(x)f(z)
(g) [3 points] k(x, z) = k3(φ(x), φ(z))
(h) [3 points] k(x, z) = p (k1(x, z))
(i) [4 points extra credit] Prove that the gaussian Kernel, k(x, z) = exp
−kx − zk
2/2σ
2

can be expressed
as φ(x)
T φ(z), where φ(.) is an infinite-dimensional vector.
[Hint: kx − zk
2 = x
T x + z
T
z − 2x
T
z. Consider using Power Series.]
3 [20 points] Kernelizing the perceptron
Let there be a binary classification problem with y ∈ {0, 1}. The perceptron uses hypotheses of the form
h(x; w) = f

wT x

, where f(z) = I[z ≥ 0]. In this problem we will consider a stochastic gradient descentlike implementation of the perceptron algorithm where each update to the parameters w is made using only
one training example. However, unlike stochastic gradient descent, the perceptron algorithm will only make
one pass through the entire training set. The update rule for this version of the perceptron algorithm is
given by
w(i+1) := w(i) + α
h
y
(i+1) − h

x
(i+1); w(i)
i x
(i+1)
where w(i)
is the value of the parameters after the algorithm has seen the first i training examples. Prior to
seeing any training examples, w(0) is initialized to 0.
Let K be a Mercer kernel corresponding to some very high-dimensional feature mapping φ. Suppose φ is so
high-dimensional (say, infinite-dimensional) that it’s infeasible to ever represent φ(x) explicitly. Describe how
you would apply the “kernel trick” to the perceptron to make it work in the high-dimensional feature space
φ, but without ever explicitly computing φ(x). [Note: You don’t have to worry about the intercept term. If
you like, think of φ as having the property that φ0(x) = 1 so that this is taken care of. ] Your description
should specify:
(a) [6 points] How you will represent the high-dimensional parameter vector w(i) using the feature vectors
φ

x
(i)

including how the initial value w(0) = 0 is represented;
(b) [7 points] How you will efficiently make a prediction on a new input x
(i+1)
. I.e., how you will compute
h

φ

x
(i+1)
; w(i)

= f

w(i)T φ

x
(i+1) , using your representation of w(i)
; [Note: In Kernelizing, we
should show that all the occurrences of feature φ(x) are in the form of inner products. And then we can
replace them with a kernel to avoid explicit computation on the features when learning model parameters
as well as making predictions.]
(c) [7 points] How you will modify the update rule given above to perform an update to w on a new training
example
x
(i+1), y(i+1)
; i.e., using the update rule corresponding to the feature mapping φ :
w(i+1) := w(i) + α
h
y
(i+1) − h

φ

x
(i+1)
; w(i)
i φ

x
(i+1)
[Note: If you prefer, you are also welcome to do this problem using the convention of labels y ∈ {−1, 1},
and f(z) = sign(z) = 1 if z ≥ 0, −1 otherw
4 [25 points] Implementing Soft Margin SVM by Optimizing Primal Objective
Support Vector Machines (SVM) is a discriminative model for classification. Although it is possible to develop
SVMs that do K class classifications, we are going to restrict ourselves to binary classification in this question,
where the class label is either +1 (positive) or −1 (negative). SVM is not a probabilistic algorithm. In other
words, in its usual construction, it does not optimize a probability measure as a likelihood. SVM tries to
find the “best” hyperplane that maximally1
separates the positive class from the negative class.
Recall that the objective function for maximizing the soft margin is equivalent to the following minimization problem:
min
w,b
1
2
kwk
2 + C
X
N
i=1
ξi (4)
subject to y
(i)

wT x
(i) + b

≥ 1 − ξi
, ∀i = 1, . . . , N (5)
ξi ≥ 0, ∀i = 1, . . . , N (6)
The above is known as the Primal Objective of SVM. Notice the two constraints on ξi
. It means that ξi must
satisfy both of those conditions, while minimizing the sum of ξi
’s times C. The constrained minimization is
equivalent to following minimization:
min
w,b
1
2
kwk
2 + C
X
N
i=1
max
0, 1 − y
(i)

wT x
(i) + b

You will be working with the above minimization in this problem. However, please think to yourself why
combining the minimization of ξ
0
i
s and the two constraints became a max as shown above. To shorten
notation, lets define our loss function E(w, b) as:
E(w, b) = 1
2
kwk
2 + C
X
N
i=1
max
0, 1 − y
(i)

wT x
(i) + b
(7)
(a) [6 points] Find the derivatives of the loss function with respect to our parameters. Show that:
∇wE(w, b) = w − C
X
N
i=1
I
h
y
(i)

wT x
(i) + b

< 1
i
y
(i)x
(i)
(8)

∂bE(w, b) = −C
X
N
i=1
I
h
y
(i)

wT x
(i) + b

< 1
i
y
(i)
, (9)
where I[.] is the indicator function.
(b) [8 points] Implement the SVM algorithm using batch gradient descent. In previous assignments, you
have implemented gradient descent while minimizing over one variable. Minimizing over two variables
(w, b) is not different. Both gradients are computed from current parameter values, and then parameters are simultanously updated. Note: it is a common mistake to implement gradient descent over
multiple parameters by updating the first parameter, then computing the derivative w.r.t second parameter using the updated value of the first parameter2
. The pseudo code for optimizing w and b is given by:
1maximally means increasing the geometric margin. Review lecture slides if you need a refresher.
2
In fact, updating one parameter then computing the gradient of the second parameter using the updated value of the first
parameter, is a different optimization algorithm, known as Coordinate Descent.
4
Algorithm 1: SVM Batch Gradient Descent
w∗ ← 0;
b
∗ ← 0;
for j = 1 to NumIterations do
wgrad ← ∇wE (w∗
, b∗
);
bgrad ← ∂
∂bE (w∗
, b∗
);
w∗ ← w∗ − α(j)wgrad;
b
∗ ← b
∗ − 0.01α(j)bgrad;
end
return w∗
The learning rate α(.) is now a function of time (iteration number) rather than a constant. This allows
us to define a decaying learning rate as:
α(j) = η0
1 + jη0
Throughout this question, set η0 = 0.5 and the slack cost C = 5. Loading the file q4 data.npy loads the
arrays q4x train, q4y train, q4x test, and q4y test. Run your implementation of SVM Batch Gradient. Descent over the training data 6 times, once for each NumIterations= {5, 50, 100, 1000, 5000, 6000}
For each run, report your trained parameters (w, b) and the test classification accuracy.
(c) [3 points] In Stochastic Gradient Descent, we compute the gradients and update our parameters for
every training example (rather than batch updating). To do stochastic gradient descent properly, we
need to define a loss function per example (the loss function E(w, b), above, is defined over the entire
training set). We can rewrite the loss function above as:
E(w, b) = 1
2
kwk
2 + C
X
N
i=1
max
0, 1 − y
(i)

wT x
(i) + b

=
X
N
i=1

1
2N
kwk
2 + C max
0, 1 − y
(i)

wT x
(i) + b

Expressing the error function as one summation allows us to define an error term per training example
like:
E
(i)
(w, b) = 1
2N
kwk
2 + C max
0, 1 − y
(i)

wT x
(i) + b

then , E(w, b) = X
N
i=1
E
(i)
(w, b)
Find the expressions for ∇wE(i)
(w, b) and ∂
∂bE(i)
(w, b)
(d) [5 points] Implement the SVM algorithm using stochastic gradient descent (SGD). The pseudo-code
5
looks like:
Algorithm 2: SVM Batch Gradient Descent
w∗ ← 0;
b
∗ ← 0;
for j = 1 to NumIterations do
for i = 1 to N do
wgrad ← ∇wE(i)
(w∗
, b∗
);
bgrad ← ∂
∂bE(i)
(w∗
, b∗
);
w∗ ← w∗ − α(j)wgrad;
b
∗ ← b
∗ − 0.01α(j)bgrad;
end
end
return w∗
Use the same α(.), η0, C as part (b). Run your implementation of SVM Stochastic Gradient Descent
over the training data 4 times, once for each NumIterations= {5, 50, 100, 1000, 5000, 6000} For each run,
report your trained parameters (w, b) and the test classification accuracy.
(e) [3 points] What can you conclude about the convergence rate of Stochastic gradient descent (SGD)
versus gradient descent? How did you make this conclusion?
5 [15 points] Scikit-learn SVM for classifying SPAM
Recall the Q4 in HW2. We will repeat it using scikit-learn SVM, and compare Naive-Bayes and SVM.
NOTE: If you are already familiar with the data used for Question 4 in HW2, you can skip the paragraphs
below and start from (a).
In our data, the text emails have been pre-processed so that they can be used for naive Bayes. The preprocessing ensures that only the email body and subject remain in the dataset; email addresses (EMAILADDR),
web addresses (HTTPADDR), currency (DOLLAR) and numbers (NUMBER) were also replaced by the special tokens to allow them to be considered properly in the classification process. (In this problem, we’ll going
to call the features “tokens”.)
We have done the feature extraction works for you, so you can just load the data matrices (called
document-term matrices in text classification) which contain all the data. In a document-term matrix, the
i
th row represents the i
th document/email, and the j
th column represents the j
th distinct token. Thus, the
(i, j) entry of this matrix represents the number of occurrences of the jth token in the ith document.
For this problem, we chose the set of tokens (vocabulary) to only contain the medium frequency tokens,
as the tokens that occur too often or too rarely do not have much classification value. (Examples: tokens
that occur very often are terms like “the,” “and,” and “of,” which occur in any spam and non-spam emails.)
Also, terms were stemmed using a standard stemming algorithm; basically, this means that “price,” “prices”
and “priced” have all been replaced with “price,” so that they can be treated as the same token.
Since the document-term matrix is sparse (has lots of zero entries), we store it in an efficient format to
save space. We provide a starter code q5.ipynb which contains the readMatrix function that reads in the
document-term matrix, the correct class labels for all emails, and the full list of tokens.
(a) [7 points] Implement an SVM classifier with linear kernel for spam classification using a python
package named scikit-learn3
. You can install this package by running the first code cell in the q5.ipynb
file. You should use the LinearSVC class4
in this package to train your parameters with the training
dataset MATRIX.TRAIN. Then, use these parameters to classify the test dataset MATRIX.TEST
3See https://scikit-learn.org/stable/ for more information about the package.
4See https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html for more information about the
LinearSVC class.
6
and report the error using the evaluate function. Implement your code in q5.ipynb.
Note: Make sure to use a linear kernel-SVM; use the LinearSVC class already imported in q5.ipynb.
(b) [4 points] Repeat part (a), but with training sets of size ranging from 50, 100, 200, . . . , up to 1400,
by using the files MATRIX.TRAIN.*. Plot the test error each time (use MATRIX.TEST as the test data) to
obtain a learning curve (test set error vs. training set size). Which training-set size gives the best test
set error?
(c) [4 points] Recall the result you got for Q4 in HW2. How do naive Bayes and Support Vector Machines
compare (in terms of generalization error) as a function of the training set size?
Update log
• (Feb/16th/21:00 Piazza@162): Fixed typos in Q3. t → y and y(x; w) → h(x; w).
• (Feb/17th/20:00 Piazza@162): Fixed typo in Q3 (b). h

x
(i+1); w(i)

→ h

φ

x
(i+1)
; w(i)

.
• (Feb/18th/9:00 Piazza@162): Standardize numIter in Q4 to 6 to avoid possible confusion. Q4(b):
{5, 50, 100, 1000} → {5, 50, 100, 1000, 5000, 6000}.
• (Minor. Feb/19th/19:00): In Q5(c), removed the following sentence since it is already implemented in
the starter code. Note that this is just to avoid possible confusion. Your answer is not supposed to be
changed. ”You may need to change the call to readMatrix…” → deleted.
Credits
Some questions adopted/adapted from http://www.stanford.edu/class/cs229/ps/ps3.pdf and from Bishop
PRML.