1. (300) Neural Networks and Backpropagation
Write a program to implement gradient descent for a 2 input (d
(0) = 2), m-hidden unit (d
(1) = m),
1 output sigmoidal neural network (L = 2). For the output node transformation, allow for
both S(s) = s and θ(s) = tanh(s). Implement gradient decent on the squared error Ein(w) =
n=1 (h(xn, w) − yn)
, and check your gradient calculation as follows:
(a) Use a network with m = 2. Set all the weights to 0.25 and consider a data set with 1 point:
; y = 1. For both the identity and tanh(·) output node transformation functions,
obtain the gradient of Ein(w) using the backpropagation algorithm. Report this result –
there should be as many numbers as parameters in this network.
(b) Now, obtain the gradient numerically by peturbing each weight in turn by 0.0001. Report
this result. (Hint: If this result is not similar to your previous result then there is something
wrong with your backpropagtion gradient calculation.)
2. (600) Neural Network for Digits
Use your neural network implementation with 10 hidden units to build a classifier for separating
the digit ‘1’ from ’not 1’ . Use the two features and data you developed in a previous assignment
and the 300 training data points you selected in assignment #9.
Randomly initialize the weights to small values. For this problem, train with the linear output
transformation function but use the sign threshold on the output node for actually classifying
(a) Plot Ein(w) versus iteration for the variable learning rate gradient descent heuristic and
2 × 106
iterations, with S(x) = x. Show the decision boundary for the resulting classifier.
(b) Now use weight decay with λ = 0.01/N and use variable learning rate gradient descent to
minimize the augmented error. Show the decision boundary for the resulting classifier.
(c) Now use early stopping with a validation set of size 50 and training set of size 250, and show
the decision boundary for the resulting classifier that had minimum validation error.
3. (300) Support Vector Machines
For this problem, we will use the data set consisting of the 2 points x1 = (1, 0), y1 = +1 and
x2 = (−1, 0), y2 = −1.
(a) Show that for these 2 data points, the optimal separating hyperplane (with maximum cushion) is just the ‘plane’ that is the perpendicular bisector of the line segment joining the two
points. In our case, what is the equation of the optimal hyperplane?
(b) Now consider a transformation to a more “complicated” Z–space. The transformation is
1 − x2
i. What are the data points in this space?
ii. Construct the optimal hyperplane in this space (i.e., what is the equation for the hyperplane in terms of the Z–coordinates).
To classify a point in X-space, one first transforms the point into Z-space. One then classifies
the point in Z-space using the optimal hyperplane in Z-space.
(c) Plot (in X–space) the decision boundary for the optimal hyperplane constructed using the
data in X–space (from part (a)). On the same plot, plot the decision boundary you would
observe in X-space if you classified X-space points by first transforming to Z-space, and
then classifying according to the optimal hyperplane constructed using the data in Z–space
(this decision boundary will not be a line!).
(d) A kernel function, K(x, y), is a function of two vectors in X-space defined by K(x, y) =
z(x)· z(y), where z(x) and z(y) are the transformed x and y into Z-space. In other words,
the kernel function computes the dot product of the transformed vectors. Give an expression
for the kernel function in terms of the components of x and y.
(e) Using this kernel function, give an explicit functional form for the classifier in the X-space.
4. (600) SVM with digits data
Implement the non-separable SVM using your two features for the 300 training points you selected
in assignment #9. Use the 8th order polynomial kernel,
) = (1 + x
(a) In the non-separable case, you need to choose the value for C > 0 (the ’regularization’
parameter). Show the decision boundary for a small and large choice of C. Use your own
judgement to determine small and large.
(b) Explain the ‘complexity’ of the decision boundaries you observe in part (a) with respect to
the choice of C.
(c) Using a grid of values for C between your small and large values, use cross validation to
pick a good value of C, one that achieves minimum cross validation error. Show the decision
boundary for the resulting classifier and give its test error.
5. (200) Compare Methods: Linear, k-NN, RBF-network, Neural Network, SVM
Compare the final test error from your attempts to solve the digits problem:
(i) Regularized linear model with 8th order polynomial transform and λ selected by CV.
(ii) k-NN rule with k selected by CV.
(iii) RBF-network with number of centers selected by CV.
(iv) Neural network with early stopping.
(v) SVM with 8th order polynomial kernel and C selected by CV.
Make some intelligent comments.