## Description

End to End Learning with Regularization and Validation

In this assignment, you will address the learning problem for predicting ‘Digit 1’ from ‘not Digit 1’.

Raw Data −→ Features −→ Learn −→ g −→ Estimate Eout(g)

You will use all the digits available on the class web site and your two features that you constructed in an

earlier assignment. yn = +1 if the data point is ‘Digit 1’ and yn = −1 otherwise.

1. Combine the training and test data into a single data set.

2. Use your feature construction algorithm to create two features for each data point.

3. Normalize your features as follows. For each data point, shift the first feature by a shift and then

rescale the first feature by a scale. You must determine the shift and scale so that the minimum feature

value is −1 and the maximum feature value is +1. Repeat this for your second feature. So now you

have a new pair of features where each feature (for every data point) is in the range [−1, 1]. This

process is an example of input normalization.

4. Randomly select 300 data points for your data set D. Put the remaining data into a test set Dtest.

5. Do not look at Dtest until you have created your final hypothesis g and are ready to estimate Eout(g).

We will treat this problem as a regression problem with real valued targets ±1 until it is time to output our

final classification hypothesis. At that point our final classification function will be sign(g).

The polynomial feature transform is

(x1, x2) → (1, x1, x2, x2

1

, x1x2, x2

2

, x3

1

, x2

1×2, x1x

2

2

, x3

2

, x4

1

, x3

1×2, . . .).

As discussed in class, this would create ‘non-orthogonal’ features, which can cause a problem for algorithms

like the pseudo-inverse regression algorithm if the columns in the data matrix become too dependent. An

‘orthogonal’ polynomial transform is given by

(x1, x2) → (1, L1(x1), L1(x2), L2(x1), L1(x1)L1(x2), L2(x2), L3(x1), L2(x1)L1(x2), . . .),

where in each feature we replace x

k

i with Lk(xi) and Lk(·) is the kth order Legendre transform. See Problem

4.3 in LFD for a recursive expression that defines the Legendre polynomials from which you can develop a

quick algorithm to compute the Lk(x) for any input x.

We will use the pseudo-inverse linear regression algorithm with weight decay regularization for learning,

which corresponds to minimizing Eaug(w) = Ein(w) + λwtw, and Ein(w) is the sum of squared errors.

Recall that the learned linear regression weights are

wreg(λ) = (ZtZ + λI)−1Z

ty,

where Z ∈ R

N×d˜

is the matrix of transformed features and wreg ∈ R

d˜

is the regularized weight vector.

1

Do the following problems to end up with your final hypothesis g.

1. (100) 8th order Feature Transform.

Use the 8th order Legendre polynomial feature transform to compute Z. What are the dimensions of Z?

2. (100) Overfitting.

Give a plot of the decision boundary for the resulting weights without any regularization (λ = 0). Do you

think there is overfitting or underfitting?

3. (100) Regularization.

Give a plot of the decision boundary for the resulting weights with λ = 2. Do you think there is overfitting

or underfitting?

4. (200) Cross Validation.

Use leave-one-out cross validation to estimate Ecv(λ) for

λ ∈ {0, 0.01, 0.02, . . . , 2}.

The upper limit 2 is arbitrary – use your judgement to pick one that is reasonable (you may need to increase

λmax, depending on your features). Ecv(λ) estimates Eout for regularization parameter λ, when learning

with N − 1 data points. You can use the formula in Equation (4.13) of LFD to efficiently compute Ecv(λ).

Plot Ecv(λ) versus λ, and on the same plot, Etest(wreg(λ)) (the regression error on the test set).

Comment on the behavior of Ecv in relation to Etest.

5. (100) Pick λ.

Use the cross validation error to pick the best value of λ, call it λ

∗

. Give a plot of the decision boundary for

the weights wreg(λ

∗

).

6. (100) Estimate Eout.

Use the classification error Etest(wreg(λ

∗

)) to estimate Eout for your final hypothesis g. What is your estimate

of Eout for separating ‘Digit 1’ from ‘Not Digit 1’?

7. (100) Is Ecv biased?

Is Ecv(λ

∗

) and unbiased estimate of Etest(wreg(λ

∗

))? Explain.

8. (200) Data snooping.

Is Etest(wreg(λ

∗

)) an unbiased estimate of Eout(wreg(λ

∗

))? Explain.

If it is not an unbiased estimate, how can what you did be fixed so that it is?

[Hint: Where has there been data snooping?. Data snooping occurs if the data used to estimate the performance of g in any way affected the selection of g as the final hypothesis. When you have figured this one

out, you will have understood one of the most subtle forms of data snooping. ]

2