Description

5/5 - (1 vote)

1. Maximum Likelihood Method: consider n random samples from a multivariate normal distribution, Xi ∈ R
p ∼ N (µ, Σ) with i = 1, . . . , n.
(a) Show the log-likelihood function
ln(µ, Σ) = −
n
2
trace(Σ−1Sn) −
n
2
log det(Σ) + C,
where Sn =
1
n
Pn
i=1(Xi − µ)(Xi − µ)
T
, and some constant C does not depend on µ and
Σ;
(b) Show that f(X) = trace(AX−1
) with A, X 0 has a first-order approximation,
f(X + ∆) ≈ f(X) − trace(X−1A
0X−1∆)
hence formally df(X)/dX = −X−1AX−1
(note (I + X)
−1 ≈ I − X);
(c) Show that g(X) = log det(X) with A, X 0 has a first-order approximation,
g(X + ∆) ≈ g(X) + trace(X−1∆)
hence dg(X)/dX = X−1
(note: consider eigenvalues of X−1/2∆X−1/2
);
(d) Use these formal derivatives with respect to positive semi-definite matrix variables to
show that the maximum likelihood estimator of Σ is
ΣˆMLE
n = Sn.
A reference for (b) and (c) can be found in Convex Optimization, by Boyd and Vandenbergh,
examples in Appendix A.4.1 and A.4.3:
https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf
2. Shrinkage: Suppose y ∼ N (µ, Ip).
1
Homework 3. MLE and James-Stein Estimator 2
(a) Consider the Ridge regression
min
µ
1
2
ky − µk
2
2 +
λ
2
kµk
2
2
.
Show that the solution is given by
µˆ
ridge
i =
1
1 + λ
yi
.
Compute the risk (mean square error) of this estimator. The risk of MLE is given when
C = I.
(b) Consider the LASSO problem,
min
µ
1
2
ky − µk
2
2 + λkµk1.
Show that the solution is given by Soft-Thresholding
µˆ
sof t
i = µsof t(yi
; λ) := sign(yi)(|yi
| − λ)+.
For the choice λ =
√
2 log p, show that the risk is bounded by
Ekµˆ
sof t(y) − µk
2 ≤ 1 + (2 log p + 1)X
p
i=1
min(µ
2
i
, 1).
Under what conditions on µ, such a risk is smaller than that of MLE? Note: see Gaussian
Estimation by Iain Johnstone, Lemma 2.9 and the reasoning before it.
(c) Consider the l0 regularization
min
µ
ky − µk
2
2 + λ
2
kµk0,
where kµk0 := Pp
i=1 I(µi 6= 0). Show that the solution is given by Hard-Thresholding
µˆ
hard
i = µhard(yi
; λ) := yiI(|yi
| > λ).
Rewriting ˆµ
hard(y) = (1 − g(y))y, is g(y) weakly differentiable? Why?
(d) Consider the James-Stein Estimator
µˆ
JS(y) =
1 −
α
kyk
2

y.
Show that the risk is
Ekµˆ
JS(y) − µk
2 = EUα(y)
where Uα(y) = p − (2α(p − 2) − α
2
)/kyk
2
. Find the optimal α
∗ = arg minα Uα(y). Show
that for p > 2, the risk of James-Stein Estimator is smaller than that of MLE for all
µ ∈ R
p
.
Homework 3. MLE and James-Stein Estimator 3
(e) In general, an odd monotone unbounded function Θ : R → R defined by Θλ(t) with
parameter λ ≥ 0 is called shrinkage rule, if it satisfies
[shrinkage] 0 ≤ Θλ(|t|) ≤ |t|;
[odd] Θλ(−t) = −Θλ(t);
[monotone] Θλ(t) ≤ Θλ(t
0
) for t ≤ t
0
;
[unbounded] limt→∞ Θλ(t) = ∞.
Which rules above are shrinkage rules?
3. Necessary Condition for Admissibility of Linear Estimators. Consider linear estimator for
y ∼ N (µ, σ2
Ip)
µˆC(y) = Cy.
Show that ˆµC is admissible only if
(a) C is symmetric;
(b) 0 ≤ ρi(C) ≤ 1 (where ρi(C) are eigenvalues of C);
(c) ρi(C) = 1 for at most two i.
These conditions are satisfied for MLE estimator when p = 1 and p = 2.
Reference: Theorem 2.3 in Gaussian Estimation by Iain Johnstone,
http://statweb.stanford.edu/~imj/Book100611.pdf
4. *James Stein Estimator for p = 1, 2 and upper bound:
If we use SURE to calculate the risk of James Stein Estimator,
R(ˆµ
JS, µ) = EU(Y ) = p − Eµ
(p − 2)2
kY k
2
< p = R(ˆµ
MLE, µ)
it seems that for p = 1 James Stein Estimator should still have lower risk than MLE for any
µ. Can you find what will happen for p = 1 and p = 2 cases?
Moreover, can you derive the upper bound for the risk of James-Stein Estimator?
R(ˆµ
JS, µ) ≤ p −
(p − 2)2
p − 2 + kµk
2
= 2 +
(p − 2)kµk
2
p − 2 + kµk
2
.

MATH5473 A Mathematical Introduction to Data Science Homework 3. MLE and James-Stein Estimator

Description

Related products

MATH 5473 Topological and Geometric Data Analysis Homework 1. PCA and MDS

MATH5473 A Mathematical Introduction to Data Science Homework 5. SDP Extensions of PCA and MDS

MATH5473 A Mathematical Introduction to Data Science Homework 6. Manifold Learning