## Description

Problem 2.1 (Bayesian analysis of the exponential distribution)

A lifetime X of a machine is modeled by an exponential distribution with unknown parameter θ.

The likelihood is p(x | θ) = θe−θx for x ≥ 0, θ > 0.

(a) Show that the MLE is ˆθ = 1/x¯, where ¯x =

1

N

PN

i=1 xi

.

(b) Suppose we observe X1 = 5, X2 = 6, X3 = 4 (the lifetimes (in years) of 3 different iid

machines). What is the MLE given this data?

(c) Assume that an expert believes θ should have a prior

p(θ) = Expon(θ | λ)

Choose the prior parameter, call it λˆ, such that E[θ] = 1/3. Hint: recall that the Gamma

distribution has the form

Ga(θ | a, b) ∝ θ

a−1

e

−θb

and its mean is a/b.

(d) What is the posterior, p(θ | D, λˆ) ?

(e) Is the exponential prior conjugate to the exponential likelihood?

(f) What is the posterior mean, E[θ | D, λˆ] ?

(g) Explain why the MLE and posterior mean differ. Which is more reasonable in this example?

Problem 2.2 (Sufficient statistics for online linear regression)

Consider fitting the model ˆy = w0 +w1x using least squares. Unfortunately we did not keep the

original data, xi

, yi

, but we do have the following functions (statistics) of the data:

x¯

(n) =

1

n

Xn

i=1

xi

, y¯

(n) =

1

n

Xn

i=1

yi

C

(n)

xx =

1

n

Xn

i=1

(xi − x¯)

2

, C(n)

xy =

1

n

Xn

i=1

(xi − x¯) (yi − y¯), C(n)

yy =

1

n

Xn

i=1

(yi − y¯)

2

(1)

(a) What are the minimal set of statistics that we need to estimate w1?

1

(b) What are the minimal set of statistics that we need to estimate w0?

(c) Suppose a new data point, xn+1, yn+1 arrives, and we want to update our sufficient statistics

without looking at the old data, which we have not stored. (This is useful for online

learning.) Show that we can do this for ¯x as follows.

x¯

(n+1) ,

1

n + 1

nX

+1

i=1

xi =

1

n + 1

nx¯

(n) + xn+1

= ¯x

(n) +

1

n + 1

xn+1 − x¯

(n)

This has the form: new estimate is old estimate plus correction. We see that the size of the

correction diminishes over time (i.e., as we get more samples). Derive a similar expression

to update ¯y.

(d) Show that one can update C

(n+1)

xy recursively using

C

(n+1)

xy =

1

n + 1

h

xn+1yn+1 + nC(n)

xy + nx¯

(n)

y¯

(n) − (n + 1)¯x

(n+1)y¯

(n+1)i

Derive a similar expression to update Cxx.

Problem 2.3 (Symmetric version of `2 regularized multinomial logistic regression)

Multiclass logistic regression has the form

p(y = c | x,W) = exp

wc0 + wT

c x

PC

k=1 exp

wk0 + wT

k

x

(2)

where W is a (D + 1)×C weight matrix. We can arbitrarily define wc = 0 for one of the classes,

say c = C, since p(y = C | x,W) = 1 −

PC−1

c=1 p(y = c | x, w). In this case, the model has the

form

p(y = c | x,W) = exp

wc0 + wT

c x

1 + PC−1

k=1 exp

wk0 + wT

k

x

If we don’t “clamp” one of the vectors to some constant value, the parameters will be unidentifiable. However, suppose we don’t clamp wc = 0, so we are using 2, but we add `2 regularization

by optimizing

X

N

i=1

log p (yi

| xi

,W) − λ

X

C

c=1

kwck

2

2

Show that at the optimum we have PC

c=1 wˆcj = 0 for j = 1 : D. (For the unregularized ˆwc0

terms, we still need to enforce that w0C = 0 to ensure identifiability of the offset.)

Problem 2.4 (SVM with Gaussian kernel)

Consider the task of training a support vector machine using the Gaussian kernel K(x, z) =

exp

−kx − zk

2/τ 2

. We will show that as long as there are no two identical points in the

training set, we can always find a value for the bandwidth parameter τ such that the SVM

achieves zero training erro

(a) Recall from class that the decision function learned by the support vector machine can be

written as

f(x) = Xm

i=1

αiy

(i)K

x

(i)

, x

+ b.

Assume that the training data x

(1), y(1)

, . . . ,

x

(m)

, y(m)

consists of points which are

separated by at least a distance of ; that is,

x

(j) − x

(i)

≥ for any i 6= j. Find values for

the set of parameters {α1, . . . , αm, b} and Gaussian kernel width τ such that x

(i)

is correctly

classified, for all i = 1, . . . , m. [Hint: Let αi = 1 for all i and b = 0. Now notice that for

y ∈ {−1, +1} the prediction on x

(i) will be correct if

f

x

(i)

− y

(i)

< 1, so find a value of

τ that satisfies this inequality for all i.]

(b) Suppose we run a SVM with slack variables using the parameter τ you found in part (a).

Will the resulting classifier necessarily obtain zero training error? Why or why not? A

short explanation (without proof) will suffice.