Description
1. [5pts] Representer Theorem. In this question, you’ll prove and apply a simplified version
of the Representer Theorem, which is the basis for a lot of kernelized algorithms. Consider a
linear model:
z = w>ψ(x)
y = g(z),
where ψ is a feature map and g is some function (e.g. identity, logistic, etc.). We are given
a training set {(x
(i)
, t(i)
)}
N
i=1. We are interested in minimizing the expected loss plus an L2
regularization term:
J (w) = 1
N
X
N
i=1
L(y
(i)
, t(i)
) + λ
2
kwk
2
,
where L is some loss function. Let Ψ denote the feature matrix
Ψ =
ψ(x
(1))
>
.
.
.
ψ(x
(N)
)
>
.
Observe that this formulation captures a lot of the models we’ve covered in this course,
including linear regression, logistic regression, and SVMs.
(a) [2pts] Show that the optimal weights must lie in the row space of Ψ.
Hint: Given a subspace S, a vector v can be decomposed as v = vS +v⊥, where vS is the
projection of v onto S, and v⊥ is orthogonal to S. (You may assume this fact without
proof, but you can review it here3
.) Apply this decomposition to w and see if you can
show something about one of the two components.
1
https://markus.teach.cs.toronto.edu/csc411-2019-01
2
http://www.cs.toronto.edu/~mren/teach/csc411_19s/syllabus.pdf
3
https://metacademy.org/graphs/concepts/projection_onto_a_subspace
1
(b) [3pts] Another way of stating the result from part (a) is that w = Ψ>α for some vector
α. Hence, instead of solving for w, we can solve for α. Consider the vectorized form of
the L2 regularized linear regression cost function:
J (w) = 1
2N
kt − Ψwk
2 +
λ
2
kwk
2
.
Substitute in w = Ψ>α, to write the cost function as a function of α. Determine the
optimal value of α. Your answer should be an expression involving λ, t, and the Gram
matrix K = ΨΨ>. For simplicity, you may assume that K is positive definite. (The
algorithm still works if K is merely PSD, it’s just a bit more work to derive.)
Hint: the cost function J (α) is a quadratic function. Simplify the formula into the
following form:
1
2α
>Aα + b
>α + c,
for some positive definite matrix A, vector b and constant c (which can be ignored).
You may assume without proof that the minimum of such a quadratic function is given
by α = −A−1b.
2. [4pts] Compositional Kernels. One of the most useful facts about kernels is that they can
be composed using addition and multiplication. I.e., the sum of two kernels is a kernel, and
the product of two kernels is a kernel. We’ll show this in the case of kernels which represent
dot products between finite feature vectors.
(a) [1pt] Suppose k1(x, x0
) = ψ1
(x)
>ψ1
(x
0
) and k2(x, x0
) = ψ2
(x)
>ψ2
(x
0
). Let kS be the
sum kernel kS(x, x0
) = k1(x, x0
) + k2(x, x0
). Find a feature map ψS
such that kS(x, x0
) =
ψS
(x)
>ψS
(x
0
).
(b) [3pts] Suppose k1(x, x0
) = ψ1
(x)
>ψ1
(x
0
) and k2(x, x0
) = ψ2
(x)
>ψ2
(x
0
). Let kP be
the product kernel kP(x, x0
) = k1(x, x0
) k2(x, x0
). Find a feature map ψP such that
kP(x, x0
) = ψP(x)
>ψP(x
0
).
Hint: For inspiration, consider the quadratic kernel from Lecture 20, Slide 11.
2