## Description

1. In class we have derived an explicit formula for least squares regression when the hypothesis class is

the class of linear functions h(x) = θ0 + θ1×1 + θ2×2 + . . . θdxd. However, not all data can be well

modeled by just a linear equation. An alternative, generally richer hypothesis class can be defined by

defining a set of n basis functions {ϕ1(x), . . . , ϕn(x)}, and considering regressors of the form

h(x) = ∑n

i=1

θi ϕi(x).

Now each θi parameter is the coefficient of the corresponding ϕi

in the linear combination of basis

functions. The {ϕi} can really be any collection of functions, but as an illustration in the d = 1 case

you might consider something like a family of Gaussian bumps

ϕi = e

(x−i)

2/(2σ

2

)

i = 1, 2, . . . , n.

In principle one could also consider a infinite sequence of basis functions, but for simplicity here we

only consider a finite set. As before, let the loss function be the squared error loss

J(θ) = 1

2

∑m

j=1

(h(xj ) − yj )

2

.

Show that similarly to the linear case, the optimal solution can be found in the form

θ = (A

⊤A)

−1A

⊤⃗y,

where ⃗y = (y1, . . . , ym)

⊤, and derive the form of the matrix A. This problem illustrates that the

least squares technique (including its SGD version) has much broader applicability than just linear

regression.

2. In class we have derived that given data {(x1, y1),(x2, y2), . . . ,(xm, ym)}, the log-likelihood for logistic

regression is

ℓ(θ) = ∑m

i=1

[

ui

log(h(xi)) + (1 − ui) log(1 − h(xi))]

, (1)

where h(x) is the logistic function

h(x) = 1

1 + e−θ·x

= g(θ · x) g(z) = 1

1 + e−z

,

and the ui

’s are just the 0/1 analogs of the yi

’s, i.e., ui = (1 + yi)/2. There is no closed form solution

for the MLE of logistic regression.

(a) For simplicity consider (1) for a single data point (x, u). Derive the form of the gradient ∇ℓ(θ).

The formula g

′

(z) = g(z) (1 − g(z)) that we found in class might come in handy.

(b) Conclude that the SGD step based on a single datapoint (xi

, ui) in the dataset is

θ ← θ − α

[

(h(xi) − ui) xi

]

.

3. An online algorithm is said to be conservative if it changes its hypothesis only when it makes a mistake.

Let C be a concept class and A be a (not necessarily conservative) online algorithm which has a finite

mistake bound M on C. Prove that there is a conservative algorithm A′

for C which also has mistake

bound M.

1

4. Recall that the k–class perceptron maintains k separate weight vectors w1, w2, . . . , wk, and predicts

yb = arg max

i∈{1,2,…,k}

(wi

· x).

If this prediction is incorrect, and the correct label should have been y, it updates the weights by

setting

wy ← wy + x/2

wyb ← wyb − x/2.

Let {(x1, y1),(x2, y2), . . .} be the training data. Assume that ∥ xt ∥ = 1 for all t, and that this dataset

is separable with a margin δ, which in this case means that there exist unit vectors v1, v2, . . . , vk such

that for each example (xt, yt)

vyt

· xt − vy · xt ≥ 2δ y ∈ {1, 2, . . . , k} \ {yt} .

(a) Show that in the k = 2 case this notion of margin is equivalent to the margin that we saw in class.

(b) In the k = 2 case we saw that the number of mistakes that the perceptron can make is upper

bounded by 1/δ2

. Derive a similar bound for the k = 3 case. Hint: Two quantities that you may

wish to consider are a = v1 · w1 + v2 · w2 + v3 · w3 and b = ∥ w1 ∥

2 + ∥ w2 ∥

2 + ∥ w3 ∥

2

. Part of

your derivation might involve showing that a ≤ 3

√

b.

5. The file train35.digits contains 2000 images of 3’s and 5’s from the famous MNIST database of

handwritten digits in text format. The size of each image is 28 × 28 pixels. Each row of the file is a

representation one image, with the 28 × 28 pixels flattened into a vector of size 784. A value of 1 for a

pixel represents black, and value of 0 represents white. The corresponding row of train35.labels is

the class label: +1 for the digit 3, or −1 for the digit 5. The file test35.digits contains 200 testing

images in the same format as train35.digits.

Implement the perceptron algorithm and use it to label each test image in test35.digits. Submit the

predicted labels in a file named test35.predictions. In the lectures, the perceptron was presented

as an online algorithm. To use the perceptron as a batch algorithm, train it by simply feeding it

the training set M times. The value of M can be expected to be less than 10, and should be set by

cross validation. Naturally, in this context, the “mistakes” made during training are not really errors.

Nonetheless, it is intructive to see how the frequency of mistakes decreases as the hypothesis improves.

Include in your write-up a plot of the cumulative number of “mistakes” as a function of the number of

examples seen.

Since the data is fairly large, for debugging purposes it might be helpful to run your code on just

subsets of the 2000 training test images. Also, it may be helpful to normalize each example to unit

norm.

2