## Description

1. You’re a Seahawks fan, and the team is six weeks into its season. The number touchdowns scored in each

game so far are given below:

[1, 3, 3, 0, 1, 5].

Let’s call these scores x1, . . . , x6. Based on your (assumed iid) data, you’d like to build a model to understand

how many touchdowns the Seahaws are likely to score in their next game. You decide to model the number of

touchdowns scored per game using a Poisson distribution. The Poisson distribution with parameter λ assigns

every non-negative integer x = 0, 1, 2, . . . a probability given by

Poi(x|λ) = e

−λ λ

x

x!

.

So, for example, if λ = 1.5, then the probability that the Seahawks score 2 touchdowns in their next game is

e

−1.5 ×

1.5

2

2! ≈ 0.25. To check your understanding of the Poisson, make sure you have a sense of whether raising

λ will mean more touchdowns in general, or fewer.

a. [5 points] Derive an expression for the maximum-likelihood estimate of the parameter λ governing the

Poisson distribution, in terms of your touchdown counts x1, . . . , x6. (Hint: remember that the log of the

likelihood has the same maximum as the likelihood function itself.)

b. [5 points] Given the touchdown counts, what is your numerical estimate of λ?

2. [10 points] In World War 2 the Allies attempted to estimate the total number of tanks the Germans had

manufacturered by looking at the serial numbers of the German tanks they had destroyed. The idea was that if

there were n total tanks with serial numbers {1, . . . , n} then its resonable to expect the observed serial numbers

of the destroyed tanks consitututed a uniform random sample (without replacement) from this set. The exact

maximum likelihood estimator for this so-called German tank problem is non-trivial and quite challenging to

work out (try it!). For our homework, we will consider a much easier problem with a similar flavor.

Let x1, . . . , xn be independent, uniformly distributed on the continuous domain [0, θ] for some θ. What is the

Maximum likelihood estimate for θ?

1

Overfitting

3. Suppose we obtain N labeled samples {(xi

, yi)}

N

i=1 from our underlying distribution D. Suppose we break

this into Ntrain and Ntest samples for our training and test set. Recall our definition of the true least squares

error

(f) = E(x,y)∼D[(f(x) − y)

2

]

(the subscript (x, y) ∼ D makes clear that our input-output pairs are sampled according to D). Our training

and test losses are defined as:

btrain(f) = 1

Ntrain

X

(x,y)∈Training Set

(f(x) − y)

2

btest(f) = 1

Ntest

X

(x,y)∈Test Set

(f(x) − y)

2

We then train our algorithm (say linear least squares regression) using the training set to obtain an fb.

a. [3 points] (bias: the test error) For all fixed f (before we’ve seen any data) show that

Etrain[btrain(f)] = Etest[btest(f)] = (f).

Conclude by showing that the test error is an unbiased estimate of our true error. Specifically, show that:

Etest[btest(fb)] = (fb)

b. [4 points] (bias: the train/dev error) Is the above equation true (in general) with regards to the training

loss? Specifically, does Etrain[btrain(fb)] equal Etrain[(fb)]? If so, why? If not, give a clear argument as to

where your previous argument breaks down.

c. [8 points] Let F = (f1, f2, . . .) be a collection of functions and let fbtrain minimize the training error such

that btrain(fbtrain) ≤ btrain(f) for all f ∈ F. Show that

Etrain[btrain(fbtrain)] ≤ Etrain,test[btest(fbtrain)].

(Hint: note that

Etrain,test[btest(fbtrain)] = X

f∈F

Etrain,test[btest(f)1{fbtrain = f}]

=

X

f∈F

Etest[btest(f)]Etrain[1{fbtrain = f}] = X

f∈F

Etest[btest(f)]Ptrain(fbtrain = f)

where the second equality follows from the independence between the train and test set.)

Bias Variance tradeoff

4. For i = 1, . . . , n let xi = i/n and yi = f(xi) + i where i ∼ N (0, σ2

) for some unknown f we wish to

approximate at values {xi}

n

i=1. We will approximate f with a step function estimator. For some m ≤ n such

that n/m is an integer define the estimator

fbm(x) =

n/m

X

j=1

cj1{x ∈

(j − 1)m

n

,

jm

n

} where cj =

1

m

X

jm

i=(j−1)m+1

yi

.

Note that this estimator just partitions {1, . . . , n} into intervals {1, . . . , m}, {m + 1, . . . , 2m}, . . . , {n − m +

1, . . . , n} and predicts the average of the observations within each interval (see Figure 1).

By the bias-variance decomposition at some xi we have

E

h

(fbm(xi) − f(xi))2

i

= (E[fbm(xi)] − f(xi))2

| {z }

Bias2

(xi)

+ E

h

(fbm(xi) − E[fbm(xi)])2

i

| {z }

Variance(xi)

2

Figure 1: Step function estimator with n = 256, m = 16, and σ

2 = 1.

a. [5 points] Intuitively, how do you expect the bias and variance to behave for small values of m? What

about large values of m?

b. [5 points] If we define ¯f

(j) =

1

m

Pjm

i=(j−1)m+1 f(xi) and the average bias-squared as 1

n

Pn

i=1(E[fbm(xi)] −

f(xi))2

, show that

1

n

Xn

i=1

(E[fbm(xi)] − f(xi))2 =

1

n

n/m

X

j=1

X

jm

i=(j−1)m+1

(

¯f

(j) − f(xi))2

c. [5 points] If we define the average variance as E

h

1

n

Pn

i=1(fbm(xi) − E[fbm(xi)])2

i

, show (both equalities)

E

”

1

n

Xn

i=1

(fbm(xi) − E[fbm(xi)])2

#

=

1

n

n/m

X

j=1

mE[(cj − ¯f

(j)

)

2

] = σ

2

m

d. [15 points] Let n = 256, σ

2 = 1, and f(x) = 4 sin(πx) cos(6πx2

). For values of m = 1, 2, 4, 8, 16, 32 plot

the average empirical error 1

n

Pn

i=1(fbm(xi) − f(xi))2 using randomly drawn data as a function of m on

the x-axis. On the same plot, using parts b and c of above, plot the average bias-squared, the average

variance, and their sum (the average error). Thus, there should be 4 lines on your plot, each described in

a legend.

e. (Extra credit [5 points] ) By the Mean-Value theorem we have that mini=(j−1)m+1,…,jm f(xi) ≤ ¯f

(j) ≤

maxi=(j−1)m+1,…,jm f(xi). Suppose f is L-Lipschitz so that |f(xi)−f(xj )| ≤ L

n

|i−j| for all i, j ∈ {1, . . . , n}

for some L < 0. Show that the average bias-squared is O( L 2m2 n2 ). Using the expression for average variance above, the total error behaves like O( L 2m2 n2 + 1 m ). Minimize this expression with respect to m. Does this value of m, and the total error when you plug this value of m back in, behave in an intuitive way with respect to n, L, σ 2 ? It turns out that this simple estimator (with the optimized choice of m) obtains the best achievable error rate up to a universal constant in this setup for this class of L-Lipschitz functions (see Tsybakov’s Introduction to Nonparametric Estimation for details). This setup of each xi deterministically placed at i/n is a good approximation for the more natural setting where each xi is drawn uniformly at random from [0, 1]. In fact, one can redo this problem and obtain nearly identical conclusions, but the calculations are messier. Ridge Regression

5. [10 points] In this problem we will study the behavior of ridge regression when the number of training examples is about the same number of dimensions. Given training data {(xi , yi)} n i=1 for xi ∈ R d and yi ∈ R, 3 recall that the ridge regression solution with parameter λ is equal to wb = arg min w Xn i=1 (x T i w − yi) 2 + λkwk 2 2 . In matrix notation this has the closed form solution wb = (XT X + λI) −1XT y. Usually we don’t like to use inverses but for this problem you may use this solution because we will choose the sample sizes relatively small.

First, generate some data: import numpy as np train_n = 100 test_n = 1000 d = 100 X_train = np.random.normal(0,1, size=(train_n,d)) a_true = np.random.normal(0,1, size=(d,1)) y_train = X_train.dot(a_true) + np.random.normal(0,0.5,size=(train_n,1)) X_test = np.random.normal(0,1, size=(test_n,d)) y_test = X_test.dot(a_true) + np.random.normal(0,0.5,size=(test_n,1)) For interpretability, consider the normalized training error kXtrainw−ytraink kytraink and test error kXtestw−ytestk kytestk . Note that for a trivial solution of w = 0, the all zeros vector, these normalized errors would equal 1. Using the closed-form solution of above, plot the normalized training error and normalized test error on the y-axis for λ = {0.0001, 0.001, 0.01, 0.1, 1, 10, 100} on the x-axis.

Because each draw of data will give you a different curve, repeat the experiment 30 times and average the curves from the trials to produce a plot. As λ grows, provide an explanation for why the training and test error behaves as it does. Ridge Regression on MNIST

6. In this problem we will implement a least squares classifier for the MNIST data set. The task is to classify handwritten images of numbers between 0 to 9. You are NOT allowed to use any of the prebuilt classifiers in sklearn. Feel free to use any method from numpy or scipy. Remember: if you are inverting a matrix in your code, you are probably doing something wrong (Hint: look at scipy.linalg.solve). Get the data from https://pypi.python.org/pypi/python-mnist. Load the data as follows: from mnist import MNIST def load_dataset(): mndata = MNIST(’./data/’) X_train, labels_train = map(np.array, mndata.load_training()) X_test, labels_test = map(np.array, mndata.load_testing()) X_train = X_train/255.0 X_test = X_test/255.0 Each example has features xi ∈ R d (with d = 28 ∗ 28 = 784) and label zj ∈ {0, . . . , 9}.

You can visualize a single example xi with imshow after reshaping it to its original 28 × 28 image shape (and noting that the label zj is accurate). We wish to learn a predictor fb that takes as input a vector in R d and outputs an index in {0, . . . , 9}. We define our training and testing classification error on a predictor f as btrain(f) = 1 Ntrain X (x,z)∈Training Set 1{f(x) 6= z} btest(f) = 1 Ntest X (x,z)∈Test Set 1{f(x) 6= z} 4 We will use one-hot encoding of the labels, i.e. of (x, z) the original label z ∈ {0, . . . , 9} is mapped to the standard basis vector ez where ez is a vector of all zeros except for a 1 in the zth position. We adopt the notation where we have n data points in our training objective with features xi ∈ R d and label one-hot encoded as yi ∈ {0, 1} k where in this case k = 10 since there are 10 digits.

a. [5 points] In this problem we will choose a linear classifier to minimize the regularized least squares objective: Wc = argminW∈Rd×k Xn i=0 kWT xi − yik 2 2 + λkWk 2 F Note that kWkF corresponds to the Frobenius norm of W, i.e. kWk 2 F = Pd i=1 Pk j=1 W2 i,j . To classify a point xi we will use the rule arg maxj=0,…,9 e T j WcT xi . Note that if W = w1 . . . wk then Xn i=0 kWT xi − yik 2 2 + λkWk 2 F = X k j=0 “Xn i=1 (e T j WT xi − e T j yi) 2 + λkW ejk 2 # = X k j=0 “Xn i=1 (w T j xi − e T j yi) 2 + λkwjk 2 # = X k j=0 kXwj − Y ejk 2 + λkwjk 2 where X = x1 . . . xn >

∈ R

n×d and Y =

y1 . . . yn

>

∈ R

n×k

. Show that

Wc = (XT X + λI)

−1XT Y

b. [8 points] Code up a function called train that takes as input X ∈ R

n×d

, Y ∈ {0, 1}

n×k

, λ > 0, and

returns Wc. Code up a function called predict that takes as input W ∈ R

d×k

, X0 ∈ R

m×d and returns

an m-length vector with the ith entry equal to arg maxj=0,…,9 e

T

j WT x

0

i where x

0

i

is a column vector representing the ith example from X0

.

Train Wc on the MNIST training data with λ = 10−4 and make label predictions on the test data. What

is the training and testing error (they should both be about 15%)?

c. [10 points] We just fit a classifier that was linear in the pixel intensities to the MNIST data. For classification of digits the raw pixel values are very, very bad features: it’s pretty hard to separate digits with linear

functions in pixel space. The standard solution to the this is to come up with some transform h : R

d → R

p

of the original pixel values such that the transformed points are (more easily) linearly separable. In this

problem, you’ll use the feature transform:

h(x) = cos(Gx + b).

where G ∈ R

p×d

, b ∈ R

p

, and the cosine function is applied elementwise. We’ll choose G to be a random

matrix, with each entry sampled i.i.d. from a Gaussian with mean µ = 0 and variance σ

2 = 0.1, and b to

be a random vector sampled i.i.d. from the uniform distribution on [0, 2π]. The big question is: how do

we choose p? Cross-validation, of course!

Randomly partition your training set into proportions 80/20 to use as a new training set and validation

set, respectively. Using the train function you wrote above, train a Wcp

for different values of p and plot

the classification training error and validation error on a single plot with p on the x-axis. Be careful, your

computer may run out of memory and slow to a crawl if p is too large (p ≤ 6000 should fit into 4 GB of

memory that is a minimum for most computers, but if you’re having trouble you can set p in the several

hundreds). You can use the same value of λ as above but feel free to study the effect of using different

values of λ and σ

2

for fun.

5

d. [2 points] Instead of reporting just the test error, which is an unbiased estimate of the true error, we

would like to report a confidence interval around the test error that contains the true error.

Lemma 1. (Hoeffding’s inequality) Fix δ ∈ (0, 1). If for all i = 1, . . . , m we have that Xi are i.i.d.

random variables with Xi ∈ [a, b] and E[Xi

] = µ then

P

1

m

Xm

i=1

Xi

!

− µ

≥

r

(b − a)

2 log(2/δ)

2m

!

≤ δ

We will use the above equation to construct a confidence interval around the true classification error

(fb) = Etest[btest(fb)] since the test error btest(fb) is just the average of indicator variables taking values in

{0, 1} corresponding to the ith test example being classified correctly or not, respectively, where an error

happens with probability µ = (fb) = Etest[btest(fb)], the true classification error.

Let pb be the value of p that approximately minimizes the validation error on the plot you just made and

use fb(x) = arg maxj x

T Wcpb

ej to compute the classification test error btest(fb). Use Hoeffding’s inequality,

of above, to compute a confidence interval that contains Etest[btest(fb)] (i.e., the true error) with probability

at least 0.95 (i.e., δ = 0.05). Report btest(fb) and the confidence interval.

6