## Description

1. (300) Neural Networks and Backpropagation

Write a program to implement gradient descent for a 2 input (d

(0) = 2), m-hidden unit (d

(1) = m),

1 output sigmoidal neural network (L = 2). For the output node transformation, allow for

both S(s) = s and θ(s) = tanh(s). Implement gradient decent on the squared error Ein(w) =

1

4N

PN

n=1 (h(xn, w) − yn)

2

, and check your gradient calculation as follows:

(a) Use a network with m = 2. Set all the weights to 0.25 and consider a data set with 1 point:

x1 =

1

1

; y = 1. For both the identity and tanh(·) output node transformation functions,

obtain the gradient of Ein(w) using the backpropagation algorithm. Report this result –

there should be as many numbers as parameters in this network.

(b) Now, obtain the gradient numerically by peturbing each weight in turn by 0.0001. Report

this result. (Hint: If this result is not similar to your previous result then there is something

wrong with your backpropagtion gradient calculation.)

2. (600) Neural Network for Digits

Use your neural network implementation with 10 hidden units to build a classifier for separating

the digit ‘1’ from ’not 1’ . Use the two features and data you developed in a previous assignment

and the 300 training data points you selected in assignment #9.

Randomly initialize the weights to small values. For this problem, train with the linear output

transformation function but use the sign threshold on the output node for actually classifying

the data.

(a) Plot Ein(w) versus iteration for the variable learning rate gradient descent heuristic and

2 × 106

iterations, with S(x) = x. Show the decision boundary for the resulting classifier.

(b) Now use weight decay with λ = 0.01/N and use variable learning rate gradient descent to

minimize the augmented error. Show the decision boundary for the resulting classifier.

(c) Now use early stopping with a validation set of size 50 and training set of size 250, and show

the decision boundary for the resulting classifier that had minimum validation error.

3. (300) Support Vector Machines

For this problem, we will use the data set consisting of the 2 points x1 = (1, 0), y1 = +1 and

x2 = (−1, 0), y2 = −1.

(a) Show that for these 2 data points, the optimal separating hyperplane (with maximum cushion) is just the ‘plane’ that is the perpendicular bisector of the line segment joining the two

points. In our case, what is the equation of the optimal hyperplane?

(b) Now consider a transformation to a more “complicated” Z–space. The transformation is

given by

z =

z1

z2

=

x

3

1 − x2

x1x2

(1)

i. What are the data points in this space?

1

ii. Construct the optimal hyperplane in this space (i.e., what is the equation for the hyperplane in terms of the Z–coordinates).

To classify a point in X-space, one first transforms the point into Z-space. One then classifies

the point in Z-space using the optimal hyperplane in Z-space.

(c) Plot (in X–space) the decision boundary for the optimal hyperplane constructed using the

data in X–space (from part (a)). On the same plot, plot the decision boundary you would

observe in X-space if you classified X-space points by first transforming to Z-space, and

then classifying according to the optimal hyperplane constructed using the data in Z–space

(this decision boundary will not be a line!).

(d) A kernel function, K(x, y), is a function of two vectors in X-space defined by K(x, y) =

z(x)· z(y), where z(x) and z(y) are the transformed x and y into Z-space. In other words,

the kernel function computes the dot product of the transformed vectors. Give an expression

for the kernel function in terms of the components of x and y.

(e) Using this kernel function, give an explicit functional form for the classifier in the X-space.

4. (600) SVM with digits data

Implement the non-separable SVM using your two features for the 300 training points you selected

in assignment #9. Use the 8th order polynomial kernel,

K(x, x

′

) = (1 + x

tx

′

)

8

.

(a) In the non-separable case, you need to choose the value for C > 0 (the ’regularization’

parameter). Show the decision boundary for a small and large choice of C. Use your own

judgement to determine small and large.

(b) Explain the ‘complexity’ of the decision boundaries you observe in part (a) with respect to

the choice of C.

(c) Using a grid of values for C between your small and large values, use cross validation to

pick a good value of C, one that achieves minimum cross validation error. Show the decision

boundary for the resulting classifier and give its test error.

5. (200) Compare Methods: Linear, k-NN, RBF-network, Neural Network, SVM

Compare the final test error from your attempts to solve the digits problem:

(i) Regularized linear model with 8th order polynomial transform and λ selected by CV.

(ii) k-NN rule with k selected by CV.

(iii) RBF-network with number of centers selected by CV.

(iv) Neural network with early stopping.

(v) SVM with 8th order polynomial kernel and C selected by CV.

Make some intelligent comments.

2