Description
Problem 1. In this problem, we will consider the limitations of the linear classifier.
a) Consider the set of binary classification problems, where the observation variable X = (X1, X2)
is binary, i.e. X ∈ {0, 1} × {0, 1}, and non-degenerate, i.e. all possible configurations (X1, X2) have
non-zero probability. Consider the decision function
h(x) =
1, wT x + b ≥ 0
0, wT x + b < 0
and let the class label Y ∈ {0, 1} be a deterministic function of X, Y = f(X). Show that there is at
least one binary mapping
f : {0, 1} × {0, 1} → {0, 1}
such that the observation X cannot be classified with zero probability of error with the decision function
h(x). Given that, by making Y = f(X), it is easy to obtain zero probability of error on this problem,
this observation is clearly problematic. Compute the smallest probability of error achievable with your
function f(·), when all configurations of X are equally likely.
b) The extension of the linear classifier to problems with M classes is to define a set of functions
gi(x) = wT
i x + bi
, i = 1, . . . , M
and assign x to class i if x ∈ Ri
, where
Ri = {x | gi(x) ≥ gj (x), ∀j 6= i}.
Show that this classifier can only implement convex decision regions by showing that, if x1 ∈ Ri and
x2 ∈ Ri
, then λx1 + (1 − λ)x2 ∈ Ri
, ∀ 0 ≤ λ ≤ 1.
Problem 2. Let g(x) = wT x + b and consider the hyperplane g(x) = 0.
a) Show that the Euclidean distance from a point xa to the hyperplane is |g(xa)|/||w|| by minimizing
||x − xa||2
subject to g(x) = 0.
b) Show that the projection of xa onto the hyperplane is
xp = xa −
g(xa)
||w||2 w.
1
Problem 3. Given a training set D = {(x1, y1), . . . ,(xn, yn)}, the linear regression problem consists of
finding the parameters a = (w, b) of the function
f(x) = wT x + b
which minimize the empirical risk
L(w, b) = Xn
i=1
(yi − wT xi − b)
2
.
a) Show that the optimal solution is of the form
a = (XT X)
−1XT y
and determine how the matrix X and vector y depend on the training patterns.
b) The matrix inversion above can be too expensive for some applications. Derive a stochastic gradient
descent algorithm for this problem, i.e. an algorithm similar to the one we have studied in class for the
Perceptron.
c) We have discussed in class that it is usually a good idea to penalize “complicated solutions.” In the
regression problem, this can be accomplished by minimizing the alternative empirical risk
L(w, b) = Xn
i=1
(yi − wT xi − b)
2 + λwT w,
resulting in what is usually called ridge regression. The second term favors solutions that make w close
to zero, therefore penalizing those that have many degrees of freedom. Show that the optimal solution
can be written as
w∗ =
X
i
αixi
and use this to re-write L(w, b) in a form that only depends on the training patterns xi through their
dot-products. Solve the resulting problem to obtain the function
y = f(x)
in a form that only depends on the training patterns through their dot-products with x.
2
Problem 4. We have seen in class that, for a training set D = {(x1, y1), . . . ,(xn, yn)}, the Perceptron
algorithm can be implemented as follows.
• set w0 = 0, b0 = 0, k = 0, R = maxi
||xi
||
• repeat
– for i = 1, . . . , n
∗ if yi(wT
k xi + bk) ≤ 0 then
· set wk+1 = wk + ηyixi
· set bk+1 = bk + ηyiR2
· set k = k + 1
• until there are no classification errors
a) Is the learning rate η relevant? Why?
b) Show that an equivalent algorithm is
• set α = 0, b = 0, R = maxi
||xi
||
• repeat
– for i = 1, . . . , n
∗ if yi(
Pn
j=1 αjyjx
T
j xi + b) ≤ 0 then
· set αi = αi + 1
· set b = b + yiR2
• until there are no classification errors
c) Can you give an interpretation to the parameters αi? Which among the samples xi are the hardest
to classify?
d) One of the interesting properties of this implementation is that it only depends on the dot-products
x
T
i xj . Can you re-write the decision function
h(x) = sgn(wT x + b)
in this form?
3
Problem 5. In this problem, we will design a neural network for image classification. In all questions,
you will use the training set contained in the directory MNISTtrain (60, 000 examples) and the test set
in the directory MNISTtest (10, 000). To read the data, you should use the script readMNIST.m (use
readDigits=60, 000 or readDigits=10, 000, respectively, and offset=0). This returns two matrices. The
first (imgs) is a matrix with n ∈ {10000, 60000} rows, where each row is a 28 × 28 image of a digit,
vectorized into a 784-dimensional row vector. The second (labels) is a matrix with n ∈ {10000, 60000}
rows, where each row contains the class label for the corresponding image in imgs. In all problems, you
should initialize the weights as samples from independent Gaussian random variables of zero mean and
unit variance. Note that the network should have a bias term. In what follows, we assume homogeneous
coordinates.
a) We start by designing a single layer neural network. For input x
n, the activation of unit k is given
by
a
n
k = wT
k x
n
.
This is then fed to the softmax non-linearity
y
n
k =
exp(a
n
k
)
Σj exp(a
n
j
)
,
which produces an estimate y
n
k = PZ|X(k|x
n) that x
n belongs to class Z = k. Given a training set
{(x
n, tn)}, where t
n is the class label for example x
n, the network is trained to minimize the crossentropy cost
E(w) = −
X
n
Xc
k=1
t
n
k
ln y
n
k
,
where w is the vector of all weights. As usual, we use backpropagation, which reduces to the weight
updates
w
(t+1)
jk = w
(t)
jk − η
∂En(w)
∂wjk
,
where En(w) is the term of the cost due to example x
n.
i) Show that the gradient is
−
∂En(w)
∂wjk
= (t
n
k − y
n
k
)x
n
j
.
ii) To actually learn the network, we use what is called batch learning. This consists of using all data
to update the weight of each unit, i.e.
w(t+1) = w(t) − η
X
N
n=1
∇E
n
(w).
Implement this learning algorithm using η = 10−5
. Make a plot of the train and test set error, as a
function of the iteration t. Report the final training and test errors.
4
b) We next consider a network of two layers. Assume that the first layer has H units, all with
non-linearity σ(·). Again, the output layer uses the softmax non-linearity and the cost function is the
cross-entropy.
i) Derive the equations for backpropagation by stochastic gradient descent for this network. Note that
this involves a different set of derivatives for the first and second layer.
ii) Assuming a sigmoid non-linearity
σ(u) = 1
1 + e−u
,
implement the batch training procedure to train the two layer network. Again, use η = 10−5
, make a
plot of the train and test set error, as a function of the iteration t, and report the final training and test
errors. Repeat the procedure for H ∈ {10, 20, 50}. Compare the performance of the three networks.
iii) We next compare the sigmoid and ReLU non-linearities. However, if you attempt to repeat the
procedure of the previous question with the ReLU, you will see that the weights will explode. To avoid
this, we introduce weight decay, i.e. a cost of the form
E = −
X
n
Xc
k=1
t
n
k
ln y
n
k +
λ
η
||w||2
.
Train a two-layer network with sigmoid hidden units and another with ReLU hidden units. Again,
use the softmax on the output layer. In these experiments, set λ = 0.001 and η = 2 × 10−6
. Make a
plot of the training and test set error as a function of the iteration t, and report the final training and
test errors. Repeat the procedure for H ∈ {10, 20, 50}. Compare the performance of the six networks.
Which non-linearity achieves better performance and faster network convergence?
iv) We next consider the importance of the regularization weight. Repeat iii) with λ = 0.0001. What
can you say about the importance of regularization?
c) Repeat a) ii) and b) iii) using stochastic gradient descent instead of the batch approach. Comment
on the differences. (Note: in this case, use learning rate η = 10−2
for the sigmoid network and
η = 2 × 10−3
for the ReLU network. Do not use weight decay.)
5