Description
1. In lecture, we showed that we could estimate the trace of a matrix A (tr(A)) by
btr(A) = 1
m
Xm
i=1
V
T
i AV i
where V 1, · · · ,V m are independent random vectors with E[V iV
T
i
] = I.
(a) Suppose that the elements of each V i are independent, identically distribution random
variables with mean 0 and variance 1. Show that Var( btr(A) is minimized by taking the
elements of V i to be ±1 each with probability 1/2.
(Hint: This is easier than it looks – Var(V
TAV ) = E[(V
TAV )
2
] − tr(A)
2
so it suffices to
minimize
E[(V
TAV )
2
] = Xn
i=1
Xn
j=1
Xn
k=1
Xn
`=1
aijak`E(ViVjVkV`).
Given our conditions on the elements of V i
, V1, · · · , Vn, most of E(ViVjVkV`) are either 0 or
1. You should be able to show that
E[(V
TAV )
2
] = Xn
i=1
a
2
iiE(V
4
i
) + constant
and find Vi to minimize E(V
4
i
) subject to E(V
2
i
) = 1.)
(b) We also noted that if B is a symmetric n×n matrix whose eigenvalues are less than 1 in
absolute value then we have the following formula for the determinate of the matrix I − B:
det(I − B) = exp
−
X∞
k=1
1
k
tr(B
k
)
!
and so we can estimate this determinant by
det( d I − B) = exp
−
1
m
Xm
i=1
V
T
i BV i +
1
2
V
T
i B
2V i + · · · +
1
r
V
T
i B
rV i
!
where r is “large enough”.
1
In linear regression, one measure of the leverage of a subset of ` observations (x1, y1), · · · ,(x`
, y`)
is 1 − det(I − H11) where H11 is an ` × ` matrix defined in terms of the linear regression
“hat” matrix as follows:
H = X(X
T X)
−1X
T =
H11 H12
H21 H22
.
From above, we can estimate det(I − H11) by
det( d I − H11) = exp
−
1
m
Xm
i=1
V
T
i H11V i +
1
2
V
T
i H
2
11V i + · · · +
1
r
V
T
i H
r
11V i
!
for some r.
Show that we can compute H11V and Hk
11V for k ≥ 2 using the hat matrix H as follows:
H
V
0
=
H11V
H21V
and
H
H
k−1
11 V
0
=
Hk
11V
H21H
k−1
11 V
.
(c) On Quercus, there is a function leverage (in the file leverage.txt) that computes the
leverage for a given subset of observations for a design matrix X. (This function uses the
QR decomposition of X to compute HV i
; the functions are qr, which computes the QR
decomposition of X and qr.fitted. which computes Hy = QQT y.)
Suppose that yi = g(xi) + εi
for i = 1, · · · , n for some smooth function g and consider the
following two parametric models for g:
g1(x) = β0 +
X
5
k=1
{β2k−1 cos(2kπx) + β2k sin(2kπx)}
and
g2(x) = β0 + β1φ1(x) + · · · + β10φ10(x)
where φ1(x), · · · , φ10(x) are B-spline functions. Suppose that x1, · · · , x1000 are equally spaced
points on [0, 1] with xi = i/1000. The B-spline functions and the respective design matrices
can be constructed using the following R code:
> x <- c(1:1000)/1000 > X1 <- 1 > for (k in 1:5) X1 <- cbind(X1,cos(2*k*pi*x),sin(2*k*pi*x)) > library(splines) # loads the library of functions to compute B-splines
> X2 <- cbind(1,bs(x,df=10)) 2 Note that both X1 and X2 are 1000 × 11 matrices. You can see the B-spline functions φ1(x), · · · , φ10(x) as follows: > plot(x,X2[,2])
> for (i in 3:11) points(x,X2[,i])
Estimate the leverage of the points {xi
: (k − 1)/20 < xi ≤ k/20} for k = 1, · · · , 20 for both designs; for each design, you will obtain 20 leverages. Comment on the differences between the leverages estimated for the two designs. To estimate the leverages for the two designs, you may want to modify the function leverage to estimate the leverages for both designs using the same values of V 1, · · · ,V m (why?). 2. Suppose that X1, · · · , Xn are independent Gamma random variables with common density f(x; α, λ) = λ αx α−1 exp(−λx) Γ(α) for x > 0
where α > 0 and λ > 0 are unknown parameters.
(a) The mean and variance of the Gamma distribution are α/λ and α/λ2
, respectively. Use
these to define method of moments estimates of α and λ based on the sample mean and
variance of the data x1, · · · , xn
(b) Derive the likelihood equations for the MLEs of α and λ and derive a Newton-Raphson
algorithm for computing the MLEs based on x1, · · · , xn. Implement this algorithm in R and
test on data generated from a Gamma distribution (using the R function rgamma). Your
function should also output an estimate of the variance-covariance matrix of the MLEs –
this can be obtained from the Hessian of the log-likelihood function.
Important note: To implement the Newton-Raphson algorithm, you will need to compute
the first and second derivatives of ln Γ(α). These two derivatives are called (respectively)
the digamma and trigamma functions, and these functions are available in R as digamma
and trigamma; for example,
> gamma(2) # gamma function evaluated at 2
[1] 1
> digamma(2) # digamma function evaluated at 2
[1] 0.4227843
> trigamma(2) # trigamma function evaluated at 2
[1] 0.6449341
3
Supplemental problems:
3. Consider LASSO estimation in linear regression where we define bβλ
to minimize
Xn
i=1
(yi − y¯ − x
T
i β)
2 + λ
X
p
j=1
|βj
|
for some λ > 0. (We assume that the predictors are centred and scaled to have mean 0 and
variance 1, in which case ¯y is the estimate of the intercept.) Suppose that the least squares
estimate (i.e. for λ = 0) is non-unique — this may occur, for example, if there is some exact
linear dependence in the predictors or if p > n. Define
τ = min
β
Xn
i=1
(yi − y¯ − x
T
i β)
2
and the set
C =
(
β :
Xn
i=1
(yi − y¯ − x
T
i β)
2 = τ
)
.
We want to look at what happens to the LASSO estimate bβλ as λ ↓ 0.
(a) Show that bβλ minimizes
1
λ
(Xn
i=1
(yi − y¯ − x
T
i β)
2 − τ
)
+
X
p
j=1
|βj
|.
(b) Find the limit of
1
λ
(Xn
i=1
(yi − y¯ − x
T
i β)
2 − τ
)
as λ ↓ 0 as a function of β. (What happens when β 6∈ C?) Use this to deduce that as λ ↓ 0,
bβλ → bβ0 where bβ0 minimizes X
p
j=1
|βj
| on the set C.
(c) Show that bβ0
is the solution of a linear programming problem. (Hint: Note that C can
be expressed in terms of β satisfying p linear equations.)
4. Consider minimizing the function
g(x) = x
2 − 2αx + λ|x|
γ
where λ > 0 and 0 < γ < 1. (This problem arises, in a somewhat more complicated form, in shrinkage estimation in regression.) The function |x| γ has a “cusp” at 0, which mean that if λ is sufficient large then g is minimized at x = 0. 4 (a) g is minimized at x = 0 if, and only if, λ ≥ 2 2 − γ ” 2 − 2γ 2 − γ #1−γ |α| 2−γ . (1) Otherwise, g is minimized at x ∗ satisfying g 0 (x ∗ ) = 0. Using R, compare the following two iterative algorithms for computing x ∗ (when condition (1) does not hold): (i) Set x0 = α and define xk = α − λγ 2 |xk−1| γ xk−1 k = 1, 2, 3, · · · (ii) The Newton-Raphson algorithm with x0 = α. Use different values of α, γ, and λ to test these algorithms. Which algorithm is faster? (b) Functions like g arise in so-called bridge estimation in linear regression (which are generalizations of the LASSO) – such estimation combines the features of ridge regression (which shrinks least squares estimates towards 0) and model selection methods (which produce exact 0 estimates for some or all parameters). Bridge estimates bβ minimize (for some γ > 0
and λ > 0),
Xn
i=1
(yi − x
T
i β)
2 + λ
X
p
j=1
|βj
|
γ
. (2)
See the paper by Huang, Horowitz and Ma (2008) (“Asymptotic properties of bridge estimators in sparse high-dimensional regression models” Annals of Statistics. 36, 587–613) for
details. Describe how the algorithms in part (a) could be used to define a coordinate descent
algorithm to find bβ minimizing (2) iteratively one parameter at a time.
(c) Prove that g is minimized at 0 if, and only if, condition (1) in part (a) holds.
5. Suppose that A is a symmetric non-negative definite matrix with eigenvalues λ1 ≥ λ2 ≥
· · · ≥ λn ≥ 0. Consider the following algorithm for computing the maximum eigenvalue λ1:
Given x0, define for k = 0, 1, 2, · · ·, xk+1 =
Axk
kAxkk2
and µk+1 =
x
T
k+1Axk+1
x
T
k+1xk+1
.
Under certain conditions, µk → λ1, the maximum eigenvalue of A; this algorithm is known
as the power method and is particularly useful when A is sparse.
(a) Suppose that v1, · · · , vn are the eigenvectors of A corresponding to the eigenvalues
λ1, · · · , λn. Show that µk → λ1 if x
T
0 v1 6= 0 and λ1 > λ2.
(b) What happens to the algorithm if if the maximum eigenvalue is not unique, that is,
λ1 = λ2 = · · · = λk?
5
6. (a) Suppose that A is an invertible matrix that can be written as I − B where B has its
eigenvalues in the interval (−1, 1). Show that
tr(A
−1
) = X∞
k=0
tr(B
k
)
(where B0 = I).
(b) We can use Hutchinson’s method to estimate tr(A−1
) by exploiting the formula in part
(a), truncating the infinite series at some finite point r (where Bk = 0 for k > r). The key
lies in writing A = α(I − B) for some constant α and matrix B whose eigenvalues lie in
(−1, 1); then
tr(A
−1
) = α
−1 X∞
k=0
tr(B
k
).
Suppose that λ1, · · · , λn > 0 are the eigenvalues of A and define µ so that
µ ≥ max
1≤i≤n
λi
.
In terms of µ, how would you define α and B?
(c) Suppose that A is symmetric positive definite with elements {aij : 1 ≤ i, j ≤ n}. Show
that we can take µ in part (b) to be
µ = max
1≤i≤n
Xn
j=1
|aij |.
(Hint: Show that µ = kAk∞.)
6