Description

5/5 - (4 votes)

1 Question 1 – Ridge Regression and LOOCV (20 points)
In class, you learned about using cross validation as a way to estimate the true error of a learning algorithm.
The preferred solution is Leave-One-Out Cross Validation (LOOCV), which provides an almost unbiased
estimate of this true error, but it can take a really long time to compute. In this problem, you will derive a
formula for efficiently computing the LOOCV error for ridge regression.
Given a set of n data points and associated labels {xi
, yi
|xi ∈ <k
, yi ∈ <}n
i=1. Ridge regression find the
weight vector w and a bias term b to optimize the following:
minimize
w,b
λ||w||2 +
Xn
i=1
(wT xi + b − yi)
2
. (1)
1.1 (4 points)
Let w = [w; b], X = [X; 1
T
n
],
¯I = [Ik, 0k; 0
T
k
, 0], C = XX
T
+ λ¯I, and d = Xy. Show that the solution of
Ridge regression is:
w = C−1d (2)
1.2 (3 points)
Now suppose we remove xi from the training data, let C(i)
, d(i)
, w(i) be the corresponding matrices for
removing xi
. Express C(i)
in terms of C and xi
. Express d(i)
in terms of d and xi
.
1.3 (4 points)
Express C
−1
(i)
in terms of C−1
and xi
. Hint: use the Sherman-Morrison formula:
(A + uvT
)
−1 = A−1 −
A−1uvT A−1
1 + vT A−1u
(3)
1.4 (3 points)
Show that
w(i) = w + (C−1xi)
−yi + x
T
i w
1 − x
T
i C−1xi
(4)
1.5 (3 points)
Show that the leave-one-out error for removing the i
th training data is:
wT
(i)xi − yi =
wT xi − yi
1 − x
T
i C−1xi
(5)
1
1.6 (3 points)
The LOOCV is defined as: Pn
i=1(wT
(i)
xi − yi)
2
. What is the algorithmic complexity of computing LOOCV
error using the formula given in Question 1.5? How is it compared with the usual way of computing
LOOCV? Note that the complexity of inverting a k × k matrix is O(k
3
).
(XX
T
+ λ¯I)w = Xdiag(s)y (6)
2 Question 2 – Naive Bayes and Logisitic Regression (20 points)
2.1 Naive Bayes with both continuous and boolean variables (10 points)
Consider learning a function X → Y where Y is boolean, where X = (X1, X2), and where X1 is a boolean
variable and X2 a continuous variable. State the parameters that must be estimated to define a Naive Bayes
classifier in this case. Give the formula for computing P(Y |X), in terms of these parameters and the feature
values X1 and X2.
2.2 Naive Bayes and Logistic Regression with Boolean variables (10 points)
In class, we showed that when Y is Boolean and X = (X1, · · · , Xd) is a vector of continuous variables, the
the assumptions of the Gaussian Naive Bayes classifier imply that P(Y |X) is given by the logistic function
with appropriate parameters θ. In particular:
P(Y = 1|X) = 1
1 + exp(−(
Pd
i=1 θiXi + θd+1))
(7)
Consider instead the case where Y is Boolean and X = (X1, · · · , Xd) is a vector of Boolean variables.
Prove for this case also that P(Y |X) follows this same form (and hence that Logistic Regression is also the
discriminative counterpart to a Naive Bayes generative classifier over Boolean features).
3 Question 3 – Implementation of SVMs (40 points + 10 bonus)
In this problem, you will implement SVMs using two different optimization techniques:(1) quadratic programming and (2) stochastic gradient descent.
3.1 Implement Kernel SVM using Quadratic Programming (15 points)
Quadratic programs refer to optimization problems in which the objective function is quadratic and the
constraints are linear. Quadratic programs are well studied in optimization literature, and there are efficient
solvers. Many Machine Learning algorithms are reduced to solving quadratic programs. In this question, we
will use the quadratic program solver of Matlab to optimize the dual objective of a kernel SVM.
The dual objective of kernel SVM can be written as:
maximize
α
Xn
j=1
αj −
1
2
Xn
i=1
Xn
j=1
yiαiyjαjk(xi
, xj ) (8)
s.t. Xn
j=1
yjαj = 0 (9)
0 ≤ αj ≤ C ∀j. (10)
1. (5 points) Write the SVM dual objective as a quadratic program. Look at the quadprog function of
Matlab, and write down what H,f, A, b, Aeq, beq, lb, ub are.
2. Use quadratic programming to optimize the dual SVM objective. In Matlab, you can use the function
quadprog.
2
3. Write a program to compute w and b of the primal from α of the dual. You only need to do this for
linear kernel.
4. (5 points) Set C = 0.1, train an SVM with linear kernel using trD, trLb in q3 1 data.mat (in
Matlab, load the data using load q3 1 data.mat). Test the obtained SVM on valD, valLb,
and report the accuracy, the objective value of SVM, the number of support vectors, and the confusion
matrix.
5. (5 points) Repeat the above question with C = 10.
3.2 Implement Multiclass SVM using Stochastic Gradient Descent (25 points + 10 bonus
points)
In this question, you will implement mutliclass SVM with Stochastic Gradient Descent. We will consider the
Crammer and Singer’s SVM formulation [1]. Note that there are several different formulations for multiclass
SVM (see https://www.csie.ntu.edu.tw/˜cjlin/papers/multisvm.pdf).
Consider a multiclass classification problems with k classes. We want to train k weight vectors w1, · · · , wk,
one for each class. wi ∈ <d
, where d is the dimension of the data. Let W denote the matrix of all weight
vectors [w1, · · · , wk]. Suppose we have n training data instances {(xi
, yi)|xi ∈ <d
, yi ∈ {1, · · · , k}}.
Crammer and Singer’s formulation seeks to optimize:
minimize
W
1
2
X
k
j=1
||wj ||2 + C
Xn
i=1
L(W, xi
, yi) (11)
Here L(W, xi
, yi) is the Multiclass Hinge loss of the i-th instance:
L(W, xi
, yi) = max{wT
yˆi
xi − wT
yi
xi + 1, 0} where yˆi = argmax
j6=yi
wT
j xi
. (12)
By distributing the regularization term to each training instance, we obtain the following equivalent
objective:
minimize
W
Xn
i=1


1
2n
X
k
j=1
||wj ||2 + CL(W, xi
, yi)

 (13)
Let Li =
1
2n
Pk
j=1 ||wj ||2 + CL(W, xi
, yi). We can use stochastic gradient descent to optimize this
objective. The update rule for wj with the i
th training instance will be:
wnew
j ← wcur
j − η∂wjLi ∀j (14)
where ∂wjLi denote the sub-gradient of Li w.r.t. wj .
Algorithm 1 Stochastic gradient descent for multiclass SVM
for epoch = 1, 2, · · · , max epoch do
η ← η0/(η1 + epoch) . Update the learning rate
(i1, · · · , in) = permute(1, · · · , n). . Shuffle the indexes of training data
for i ∈ {1, 2, · · · , n} do
Update W using Eq. (14)
end for
end for
3
1. (2 points) What is the subgradient of Li wrt to wyi
?
2. (2 points) What is the subgradient of Li wrt to wyˆi
?
3. (1 point) What is the subgradient of Li wrt to wj for j 6= yi and j 6= ˆyi
.
4. Implement SGD for multiclass SVM given in Algorithm 1. η0, η1 are tunable parameters. Initially
start all the weights at 0.
5. (5 points) Using trD, trLb in q3 1 data.mat as your training set, run 2000 epochs over the
dataset using η0 = 1, η1 = 100, C = 0.1 and C = 10. Plot the loss in Eq. (13) after each epoch.
Compare with the objective value obtained in 3.1.4.
6. (5 points) Using the W learned after 2000 epochs, report:
(a) The prediction error on valD, valLb in q3 1 data.mat (test error)
(b) The prediction error on trD, trLb (training error)
(c) Pk
j=1 ||wj ||2
7. (10 points + 10 Bonus) For this question, you will use the previous multiclass implementation (Q 3.2)
or multiple binary kernel SVMs (Q 3.1) to do activity recognition on the UCF101 data (see http://
crcv.ucf.edu/data/UCF101.php). Originally, this data has 101 classes but for this homework
you will be using just 10 classes of data to train your multiclass SVM classifier and compete in an inclass Kaggle competition:
https://www.kaggle.com/c/hw2-activity-recognition-cse512-spr18/.
Training data are provided in q3 2 data.mat. Use trD, trLb for training your SVM classifier.
Validate your obtained SVM on valD, valLb, then provide the prediction for tstD in a .csv file.
You can download the data from:
https://www.kaggle.com/c/hw2-activity-recognition-cse512-spr18/data
We have already computed feature vectors for you. Each feature vector has 4096 features. For reference, we also provide the jpeg images from which the feature vectors were extracted, but you are not
required to use them. The training and validation labels are correspondence to trLb and valLb in
q3 2 data.mat. Play around with parameters, epochs to achieve a good score. You’re not allowed
to use any other classifiers for this submission. Report the best accuracy and the parameters you used
to achieve that. Also report a plot of the training loss after each epoch.
We will maintain a leader board, and the top three entries at the end of the competition (assignment
due date) will receive 10 bonus points. Any submission that rises to top three after the assignment
deadline is not eligible for bonus points. The ranking will be based on the Categorization accuracy
(percentage of correct label).
To prevent exploiting test data, you are allowed to make a maximum of 2 submissions per 24 hours.
Your submission will be evaluated immediately and the leader board will be updated.
4 Question 4 – SVM for object detection (20 points + 10 bonus points)
In this question, you will train a SVM and use it for detecting human upper bodies in your favorite TV series
The Big Bang Theory. You must use your SVM implementation in either Question 3.1 or 3.2.
To detect human upper bodies in images, we need a classifier that can distinguish between upper-body
image patches from non-upper-body patches. To train such a classifier, we can use SVMs. The training
4
data is typically a set of images with bounding boxes of the upper bodies. Positive training examples are
image patches extracted at the annotated locations. A negative training example can be any image patch that
does not significantly overlap with the annotated upper bodies. Thus there potentially many more negative
training examples than positive training examples. Due to memory limitation, it will not be possible to use
all negative training examples at the same time. In this question, you will implement hard-negative mining
to find hardest negative examples and iteratively train an SVM.
4.1 Data
Training images are provided in the subdirectory trainIms. The annotated locations of the upper bodies
are given in trainAnno.mat. This file contains a cell structure ubAnno; ubAnno{i} is the annotated
locations of the upper bodies in the i
th image. ubAnno{i} is 4×k matrix, where each column corresponds
to an upper body. The rows encode the left, top, right, bottom coordinates of the upper bodies (the origin of
the image coordinate is at the top left corner).
Images for validation and test are given in valIms, testIms respectively. The annotation file for
test images is not released. We have also extracted some image regions of test images, and the regions are
saved as 64×64 jpeg images in testRegs. Only small portion of these images correspond to upper bodies.
4.2 External library
Raw image intensity values are not robust features for classification. In this question, we will use Histogram
of Oriented Gradient (HOG) as image features. HOG uses the gradient information instead of intensities,
and this is more robust to changes in color and illumination conditions. See [2] for more information about
HOG, but it is not required for this assignment.
To use HOG, you will need to install an VL FEAT: http://www.vlfeat.org. This is an excellent
cross-platform library for computer vision and machine learning. However, in this homework, you are only
allowed to use the HOG calculation and visualization function vl hog. In fact, you should not call vl hog
directly. Use the supplied helper functions instead; they will call vl hog.
4.3 Helper functions
To help you, a number of utility functions and classes are provided. The most important functions are in
HW2 Utils.m.
1. Run HW2 Utils.demo1 to see how to read and display upper body annotation
2. Run HW2 Utils.demo2 to display image patches and HOG feature images. Compare HOG features
for positive and negative examples, can you see why HOG would be useful for detect upper bodies?
3. Use HW2 Utils.getPosAndRandomNeg() to get initial training and validation data. Positive
instances are HOG features extracted at the locations of upper bodies. Negative instances are HOG
features at random locations of the images. The data used in Question 3 is actually generated using
this function.
4. Use HW2 Utils.detect to run the sliding window detector. This returns a list of locations and
SVM scores. This function can be used for detecting upper bodies in an image. It can also be used to
find hardest negative examples in an image.
5. Use HW2 Utils.cmpFeat to compute HOG feature vector for an image patch.
6. Use HW2 Utils.genRsltFile to generate result file.
7. Use HW2 Utils.cmpAP to compute the Average Precision for the result file.
5
Algorithm 2 Hard negative mining algorithm
P osD ← all annotated upper bodies
NegD ← random image patches
(w, b) ← trainSVM(P osD, NegD)
for iter = 1, 2, · · · do
A ← All non support vectors in NegD.
B ← Hardest negative examples . Run UB detection and find negative patches that
. violate the SVM margin constraint the most
NegD ← (NegD \ A) ∪ B.
(w, b) ← trainSVM(P osD, NegD)
end for
8. Use HW2 Utils.rectOverlap to compute the overlap between two rectangular regions. The
overlap is defined as the area of the intersection over the area of the union. A returned detection region
is considered correct (true positive) if there is an annotated upper body such that the overlap between
the two boxes is more than 0.5.
9. Some useful Matlab functions to work with images are: imread, imwrite, imshow, rgb2gray, imresize.
4.4 What to implement?
1. (5 points) Use the training data in HW2 Utils.getPosAndRandomNeg() to train an SVM classifier. Use this classifier to generate a result file (use HW2 Utils.genRsltFile) for validation data.
Use HW2 Utils.cmpAP to compute the AP and plot the precision recall curve. Submit your AP and
precision recall curve (on validation data).
2. Implement hard negative mining algorithm given in Algorithm 2. Positive training data and random
negative training data can be generated using HW2 Utils.getPosAndRandomNeg(). At each
iteration, you should remove negative examples that do not correspond to support vectors from the
negative set. Use the function HW2 Utils.detect on train images to identify hardest negative
examples and include them in the negative training set. Use HW2 Utils.cmpFeat to compute
HOG feature vectors.
Hints: (1) a negative example should not have significant overlap with any annotated upper body. You
can experiment with different threshold but 0.3 is a good starting point. (2) make sure you normalize
the feature vectors for new negative examples. (3) you should compute the objective value at each
iteration; the objective values should not decrease.
3. (10 points) Run the negative mining for 10 iterations. Assume your computer is not so powerful and so
you cannot add more than 1000 new negative training examples at each iteration. Record the objective
values (on train data) and the APs (on validation data) through the iterations. Plot the objective values.
Plot the APs.
4. (5 points) For this question, you will need to generate a result file for test data using the function
HW2 Utils.genRsltFile. You will need to submit this file to our evaluation sever (https:
//goo.gl/forms/RFAJpiUxJvCdtjTw2) to receive the AP on test data. Report the AP in your
answer file. Important Note: You MUST use your Stony Brook ID to name your submission file,
i.e., your SBU ID.mat (e.g., 012345679.mat). Your submission will not be evaluated if you don’t
use your SBU ID.
6
5. (10 bonus points) Your submitted result file for test data will be automatically entered a competition
for fame. We will maintain a leader board at (http://goo.gl/L1cSxB) and the top three entries
at the end of the competition (due date) will receive 10 bonus points. The ranking is based on AP.
You can submit the result as frequent as you want. However, the evaluation server will only evaluate
all submissions three times a day, at 11:00am, 5:00pm, and 11:00pm. The system only keeps the recent
submission file, and your new submission will override the previous ones. Therefore, you have three
chances a day to evaluate your method. The leader board will be updated in 30 minutes after every
evaluation.
You are allowed to use any feature types for this part of the homework. For example, you can use
different parameter settings for HOG feature computation. You can even combine multiple HOG
features. You can also append HOG features with geometric features (e.g., think about the locations
of the upper body). You are allowed to perform different types of feature normalization (e.g, L1,
L2). You can use both training and validation data to train your classifier. You are allowed to use
SVMs, Ridge Regression, Lasso Regression, or any technique that we have covered. You can run hard
negative mining algorithm for as many iterations as you want, and the number of negative examples
added at each iteration is not limited by 1000.
5 What to submit?
5.1 Blackboard submission
You will need to submit both your code and your answers to questions on Blackboard. Do not submit the
provided data. Put the answer file and your code in a folder named: SUBID FirstName LastName (e.g.,
10947XXXX lionel messi). Zip this folder and submit the zip file on Blackboard. Your submission must be a
zip file, i.e, SUBID FirstName LastName.zip. The answer file should be named: answers.pdf, and it should
contain:
1. Answers to Question 1 and 2
2. Answers to Question 3.1 and 3.2, including the requested plots.
3. Answers to Question 4.3, including the requested plots.
5.2 Prediction submission
For Question 3.2.7, you must submit a .csv file to get the accuracy through Kaggle (https://www.
kaggle.com/c/hw2-activity-recognition-cse512-spr18/). A submission file should contain two columns: Id and Class. The file should contain a header and have the following format.
Id, Class
1, 1
2, 10
… …
A sample submission file is available from the competition site and our handout.
For Questions 4.4.4, 4.4.5, you must submit a mat file to get the AP through https://goo.gl/
forms/RFAJpiUxJvCdtjTw2. You MUST use your Stony Brook CS email account to submit. A submission file can be automatically generated by HW2 Utils.genRsltFile.
6 Cheating warnings
Don’t cheat. You must do the homework yourself, otherwise you won’t learn. You must use your SBU ID
as your file name for the competition. Do not fake your Stony Brook ID to bypass the submission limitation
per 24 hours. Doing so will be considered cheating.
7
References Cited
[1] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265–292, 2001.
[2] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2005.
8

CSE512 – Machine Learning Homework 2

Description

Related products

CSE 512: Homework 3 solved

CSE 512 Assignment 2

CSE512 – Machine Learning Homework 5