Description
Problem 2.1 (Bayesian analysis of the exponential distribution)
A lifetime X of a machine is modeled by an exponential distribution with unknown parameter θ.
The likelihood is p(x | θ) = θe−θx for x ≥ 0, θ > 0.
(a) Show that the MLE is ˆθ = 1/x¯, where ¯x =
1
N
PN
i=1 xi
.
(b) Suppose we observe X1 = 5, X2 = 6, X3 = 4 (the lifetimes (in years) of 3 different iid
machines). What is the MLE given this data?
(c) Assume that an expert believes θ should have a prior
p(θ) = Expon(θ | λ)
Choose the prior parameter, call it λˆ, such that E[θ] = 1/3. Hint: recall that the Gamma
distribution has the form
Ga(θ | a, b) ∝ θ
a−1
e
−θb
and its mean is a/b.
(d) What is the posterior, p(θ | D, λˆ) ?
(e) Is the exponential prior conjugate to the exponential likelihood?
(f) What is the posterior mean, E[θ | D, λˆ] ?
(g) Explain why the MLE and posterior mean differ. Which is more reasonable in this example?
Problem 2.2 (Sufficient statistics for online linear regression)
Consider fitting the model ˆy = w0 +w1x using least squares. Unfortunately we did not keep the
original data, xi
, yi
, but we do have the following functions (statistics) of the data:
x¯
(n) =
1
n
Xn
i=1
xi
, y¯
(n) =
1
n
Xn
i=1
yi
C
(n)
xx =
1
n
Xn
i=1
(xi − x¯)
2
, C(n)
xy =
1
n
Xn
i=1
(xi − x¯) (yi − y¯), C(n)
yy =
1
n
Xn
i=1
(yi − y¯)
2
(1)
(a) What are the minimal set of statistics that we need to estimate w1?
1
(b) What are the minimal set of statistics that we need to estimate w0?
(c) Suppose a new data point, xn+1, yn+1 arrives, and we want to update our sufficient statistics
without looking at the old data, which we have not stored. (This is useful for online
learning.) Show that we can do this for ¯x as follows.
x¯
(n+1) ,
1
n + 1
nX
+1
i=1
xi =
1
n + 1
nx¯
(n) + xn+1
= ¯x
(n) +
1
n + 1
xn+1 − x¯
(n)
This has the form: new estimate is old estimate plus correction. We see that the size of the
correction diminishes over time (i.e., as we get more samples). Derive a similar expression
to update ¯y.
(d) Show that one can update C
(n+1)
xy recursively using
C
(n+1)
xy =
1
n + 1
h
xn+1yn+1 + nC(n)
xy + nx¯
(n)
y¯
(n) − (n + 1)¯x
(n+1)y¯
(n+1)i
Derive a similar expression to update Cxx.
Problem 2.3 (Symmetric version of `2 regularized multinomial logistic regression)
Multiclass logistic regression has the form
p(y = c | x,W) = exp
wc0 + wT
c x
PC
k=1 exp
wk0 + wT
k
x
(2)
where W is a (D + 1)×C weight matrix. We can arbitrarily define wc = 0 for one of the classes,
say c = C, since p(y = C | x,W) = 1 −
PC−1
c=1 p(y = c | x, w). In this case, the model has the
form
p(y = c | x,W) = exp
wc0 + wT
c x
1 + PC−1
k=1 exp
wk0 + wT
k
x
If we don’t “clamp” one of the vectors to some constant value, the parameters will be unidentifiable. However, suppose we don’t clamp wc = 0, so we are using 2, but we add `2 regularization
by optimizing
X
N
i=1
log p (yi
| xi
,W) − λ
X
C
c=1
kwck
2
2
Show that at the optimum we have PC
c=1 wˆcj = 0 for j = 1 : D. (For the unregularized ˆwc0
terms, we still need to enforce that w0C = 0 to ensure identifiability of the offset.)
Problem 2.4 (SVM with Gaussian kernel)
Consider the task of training a support vector machine using the Gaussian kernel K(x, z) =
exp
−kx − zk
2/τ 2
. We will show that as long as there are no two identical points in the
training set, we can always find a value for the bandwidth parameter τ such that the SVM
achieves zero training erro
(a) Recall from class that the decision function learned by the support vector machine can be
written as
f(x) = Xm
i=1
αiy
(i)K
x
(i)
, x
+ b.
Assume that the training data x
(1), y(1)
, . . . ,
x
(m)
, y(m)
consists of points which are
separated by at least a distance of ; that is,
x
(j) − x
(i)
≥ for any i 6= j. Find values for
the set of parameters {α1, . . . , αm, b} and Gaussian kernel width τ such that x
(i)
is correctly
classified, for all i = 1, . . . , m. [Hint: Let αi = 1 for all i and b = 0. Now notice that for
y ∈ {−1, +1} the prediction on x
(i) will be correct if
f
x
(i)
− y
(i)
< 1, so find a value of
τ that satisfies this inequality for all i.]
(b) Suppose we run a SVM with slack variables using the parameter τ you found in part (a).
Will the resulting classifier necessarily obtain zero training error? Why or why not? A
short explanation (without proof) will suffice.