Description
1. Consider the model
Yi = θi + εi (i = 1, · · · , n)
where {εi} is a sequence of random variables with mean 0 and finite variance representing
noise. As in Assignment #2, we will assume that θ1, · · · , θn are dependent or “smooth” in
the sense that the absolute differences {θi−θi−1} are small for most values of i. Rather than
penalizing a lack of smoothness by P
i
(θi − θi−1)
2
(as in Assignment #2), we will consider
estimating {θi} given data {yi} by minimizing
Xn
i=1
(yi − θi)
2 + λ
Xn
i=2
θi − θi−1 (1)
where λ > 0 is a tuning parameter. The resulting estimates bθ1, · · · ,
bθn, are called fusion
estimates.
The GaussSeidel algorithm used in Assignment #2 is effectively a coordinate descent algorithm. However, in computing the fusion estimates minimizing (1), application of the coordinate descent algorithm is frustrated to some extent by the fact that the nondifferentiable
penalty in (1) is not separable. However, this can be easily fixed by defining φi = θi − θi−1
for i = 2, · · · , n and then minimizing
Xn
i=1
(yi − θi)
2 + λ
Xn
i=2
φi
 (2)
where now each θi (for i = 2, · · · , n) will be a function of θ1, φ2, · · · , φi
.
(a) Show that θk = θ1 +
Pk
i=2 φi
for k ≥ 2.
(b) For fixed values of φ2, · · · , φn, show that the objective function is minimized at
θ1 =
1
n
Xn
i=1
yi −
X
i
j=2
φj
.
(c) If we fix the values of θ1 and {φi
: 2 ≤ i 6= j ≤ n}, show that the subgradient of the
objective function (2) with respect to φj
is
−2
Xn
i=j
yi − θ1 −
X
i
k=2
φk
!
+ λ∂φj
 = 2(n − j + 1)φj − 2
Xn
i=j
yi − θ1 −
X
i
k=2;k6=j
φk
+ λ∂φj

1
where
∂φj
 =
+1 if φj > 0
[−1, 1] if φj = 0
−1 if φj < 0
and hence the objective function is minimized over φj
for fixed values of θ1 and {φi
: 2 ≤
i 6= j ≤ n} at
φj = 0 if
Xn
i=j
yi − θ1 −
X
i
k=2;k6=j
φk
≤
λ
2
φj =
1
n − j + 1
Xn
i=j
yi − θ1 −
X
i
k=2;k6=j
φk
−
λ
2
if Xn
i=j
yi − θ1 −
X
i
k=2;k6=j
φk
>
λ
2
φj =
1
n − j + 1
Xn
i=j
yi − θ1 −
X
i
k=2;k6=j
φk
+
λ
2
if Xn
i=j
yi − θ1 −
X
i
k=2;k6=j
φk
< −
λ
2
(d) [BONUS – but highly recommended!] Write a function in R implementing the coordinate
descent algorithm suggested in parts (b) and (c) and test it on the data generated as in
Assignment #2. (Convergence of the coordinate descent algorithm here is very slow but you
should be able to produce good estimates of the underlying function.)
2. The Hidalgo stamp data is a (semi)famous dataset containing thicknesses of 482 postage
stamps from the 1872 Mexican “Hidalgo” issue. It is believed that these stamps were printed
on different types of papers so that the data can be modeled as a “mixture” of several
distributions. The data are available in a file stamp.txt on Blackboard.
There is some consensus that a good model for these data is a mixture of normals; that is,
the density of the data is
f(x) = Xm
k=1
λkfk(x)
where fk is the density of a normal distribution with unknown mean and variance µk and σ
2
k
respectively. {λk} are nonnegative with λ1 + · · · + λm = 1.
(a) The “complete” data likelihood here is
L(µ1, · · · , µm, σ2
1
, · · · , σ2
m, λ1, · · · , λm) = Yn
i=1
Ym
k=1
{fk(xi)
uik λ
uik
k
}
where uik = 0 or 1 with ui1 + · · · + uim = 1 for i = 1, · · · , n. Show that the complete data
MLEs are
µbk =
Xn
i=1
uik!−1
Xn
i=1
uikxi
2
σb
2
k =
Xn
i=1
uik!−1
Xn
i=1
uik(xi − µbk)
2
bλk =
1
n
Xn
i=1
uik.
For the EM algorithm, we need to estimate the unobserved {uik}; for given {λk} and {fk}
(which depend on the means and variances), these estimates are
ubik =
λkfk(xi)
λ1f1(xi) + · · · + λmfm(xi)
.
(You do not need to prove this.)
(b) Write an R function that uses the EM algorithm to compute the MLEs of λ1, · · · , λm,
µ1, · · · , µm and σ
2
1
, · · · , σ2
m. Your function should take as inputs initial estimates of these
unknown parameters. (A template function will be provided.)
(c) Using your function, estimate the parameters in the normal mixture model with m = 5, 6,
and 7 components. The success of the EM algorithm for mixture models depends on having
reasonable initial estimates of the parameters. One simple ad hoc approach is to use a kernel
density estimate where the bandwidth parameter is varied so that the estimate has the
appropriate number of modes (corresponding to the different components. This can be done
using the R function density — use the default kernel (which is a Gaussian kernel) and
change the bandwidth using the parameter bw; for example, if the data are in stamp then
plot(density(stamp,bw=0.002)) will give a plot of the density estimate for a bandwidth
of 0.002.
(d) Which of the 3 models considered in part (b) do you prefer and why?
3