Description
1 Kernel SVM [15 pts]
Consider the following kernel function defined over z, z′ ∈ Z:
k(z, z′
) = (
1 if z = z
′
,
0 otherwise.
1. (5 pts) Prove that for any integer m > 0, any z1, . . . , zm ∈ Z, the m × m kernel matrix K = [Kij ] is
positive semi-definite, where Kij = k(zi
, zj ) for i, j = {1 . . . m}. (Let us assume that for i ̸= j, we have
zi ̸= zj .)
2. (5 pts) Given a training set (z1, y1), . . . ,(zn, yn) with binary labels, the dual SVM problem with the above
kernel k will have parameters α1, . . . , αn, b ∈ R. (Assume that for i ̸= j, we have zi ̸= zj .) The predictor
for input z takes the form
f(z) = Xn
i=1
αiyik(zi
, z) + b.
Recall the label prediction is sgn(f(z)). Prove that there exists α1, . . . , αn, b such that f correctly separates
the training set. In other words, k induces a feature space rich enough such that in it any training set can be
linearly separated.
3. (5 pts) How does that f predict input z that is not in the training set?
Comment: One useful property of kernel functions is that the input space Z does not need to be a vector space; in
other words, z does not need to be a feature vector. For all we know, Z can be turkeys in the world. As long as we
can compute k(z, z′
), kernel SVM works on turkeys.
2 Game of Classifiers [50 pts]
2.1 Implementation
Implement the following models in choice of your programming language. Include slack variables in SVM implementation if needed. You can use autograd features of pytorch, tensorflow etc. or derive gradients on your own.
(But don’t use inbuilt models for SVM, Kernel SVM and Logistic Regression from libraries)
• Implement Linear SVM (without kernels).
• Implement Kernel SVM, with options for linear, rbf and polynomial kernels. You should keep the kernel
parameters tunable (e.g. don’t fix the degree of polynomial kernels but keep it as a variable and play with
different values of it. Is Linear SVM a special case of Kernel SVMs?
• Implement Logistic Regression with and without kernels (use same kernel as Question 1).
2.2 Evaluation on Synthetic Dataset
2.2.1 Synthetic Dataset-1 (20 pts)
Generate 2-D dataset as following,
Let µ = 2.5 and I2 be the 2 × 2 identity matrix. Generate points for the positive and negative classes respectively
from N ([µ, 0], I2), and N ([−µ, 0], I2). For each class generate 750 points, (1500 in total).
Randomly create train,
validation and test splits of size 1000, 250, 250 points respectively. Do the following with this dataset:
• (5 pts) Train your Linear SVM, Logistic Regression models and report decision boundaries, test accuracies.
• (5 pts) Show decision boundaries with K-NN and Naive Bayes Classifiers. ( You can use library implementations or implement from scratch. Figure out the hyper-parameters using the validation set.)
• (5 pts) Repeat the process by varying µ from 1.0 to 2.4 with step size of 0.2, for each value of µ obtain test
accuracies of the models and plot ( µ on x-axis and test accuracy on y-axis). ( You will have a curve for
each of the 4-classifiers mentioned above)
• (5 pts) What are your conclusions from this exercise?
2.2.2 Synthetic Dataset-2 (20 pts)
Generate 1500 data points from the 2-D circles dataset of sklearn (sklearn.datasets.make circles).
Randomly create train, validation and test splits of size 1000, 250, 250 points respectively. Evaluate the above
classifiers on this setting.
• (5 pts) Show decision boundaries for Linear SVM and Logistic Regression classifiers.
• (5 pts) Show decision boundaries for Kernel SVM and Kernel Logistic Regression (use rbf, polynomial
kernels). Try different values of hyperparameters, report results with whichever works the best.
• (5 pts) Train Neural Network from HW4, and K-NN classifiers on this dataset and show decision boundaries.
(You can use library implementation for these classifiers).
• (5 pts) What are your conclusions from this exercise?
2.3 Evaluation on Real Dataset (10 pts)
Lets put all this to some real use. For this problem use the the Wisconsin Breast Cancer dataset. You can download
it from sklearn library(sklearn.datasets.load breast cancer)
• (5 pts) Do all the points of section 2.2.2 on this dataset. Since this is high-dimensional data, so you don’t
have to show the decision boundaries. Just report test accuracies for these classifiers and discuss your
findings.
• (5 pts) In addition, you also want to figure out the important features which determine the class. Which
regularization will you use for this? Upgrade your SVM, Kernel SVM implementation to include this
regularization. Discuss what are the important features that you obtain by running your regularized SVM
on this dataset. (You might have to normalize this dataset before training any classifier).
3 VC dimension [20 pts]
1. (7 pts) Let the input x ∈ X = R. Consider F = {f(x) = sgn(ax2 + bx + c) : a, b, c ∈ R}, where
sgn(z) = 1 if z ≥ 0, and 0 otherwise. What is V C(F)? Prove it.
Shatter Coefficient and VC dimension
The main intuition behind VC theory is that, although a collection of classifiers may be infinite, using a finite
set of training data to select a good rule effectively reduces the number of different classifiers we need to
consider.
We can measure the effective size of class F using shatter coefficient. Suppose we have a training
set Dn = {(xi
, yi)}
n
i=1 for a binary classification problem with labels yi = {−1, +1}. Each classifier in F
produces a binary label sequence
f(x1), · · · , f(xn)
∈ {−1, +1}
n
There are at most 2
n distinct sequences, but often no all sequences can be generated by functions in F.
Shatter coefficient S(F, n) is defined as the maximum number of labeling sequences the class F induces
over n training points in the feature space X . More formally,
S(F, n) = max
x1,··· ,xn∈X
n
f(x1), · · · , f(xn)
o
∈ {−1, +1}
n
, f ∈ F
where | · | denotes the number of elements in the set. The Vapnik-Chervonenkis (VC) dimension V (F) of a
class F is defined as the largest integer k such that S(F, k) = 2k
.
2. (7 pts) Suppose there are n points (x1, · · · , xn) ∈ R. Let F be the collection of 1-d linear classifiers: for
each t ∈ [0, 1], ft ∈ F labels x ≤ t as −1 and x > t as +1, or vice-versa. What is the shatter coefficient
S(F, n)? What is VC-dimension V (F)? How can you get it from shatter coefficient?
3. (6 pts) Show the following monotonicity property of VC-dimension: if F and F
′
are hypothesis classes
with F ⊂ F′
, then the VC-dimension of F
′
is greater than or equal to that of F.
4 Q-learning [15 pts]
Consider the following Markov Decision Process. It has two states s. It has two actions a: move and stay. The
state transition is deterministic: “move” moves to the other state, while “stay’ stays at the current state.
The reward
r is 0 for move, 1 for stay. There is a discounting factor γ = 0.8.
A B
+1 +1
0
0
The reinforcement learning agent performs Q-learning. Recall the Q table has entries Q(s, a). The Q table is
initialized with all zeros.
The agent starts in state s1 = A. In any state st, the agent chooses the action at
according to a behavior policy at = πB(st). Upon experiencing the next state and reward st+1, rt the update is:
Q(st, at) ⇐ (1 − α)Q(st, at) + α
rt + γ max
a′
Q(st+1, a′
)
.
Let the step size parameter α = 0.5.
1. (5 pts) Run Q-learning for 200 steps with a deterministic greedy behavior policy: at each state st use the
best action at ∈ arg maxa Q(st, a) indicated by the current Q table. If there is a tie, prefer move. Show the
Q table at the end.
2. (5 pts) Reset and repeat the above, but with an ϵ-greedy behavior policy: at each state st, with probability
1 − ϵ choose what the current Q table says is the best action: arg maxa Q(st, a); Break ties arbitrarily.
Otherwise, (with probability ϵ) uniformly chooses between move and stay (move or stay both with 1/2
probability). Use ϵ = 0.5.
3. (5 pts) Without doing simulation, use Bellman equation to derive the true Q table induced by the MDP.