Stat4DS / Homework 03 Who let the DAGs out?

$25.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (8 votes)

Who let the DAGs out?
Remember DAGs? Good. It’s now time to learn-how-to-learn their topology (at least in the Gaussian case) and then put
them to work in a biological setup.
Our statistical recipe needs a lot of ingredients, namely:
1. The basics of the Likelihood Ratio Test (lrt) method.
2. The concept and ideas behind Universal Inference.
3. The notion of Gaussian DAGs and their (constrained) likelihood functions
4. The incarnation of lrt for testing directed connections in Gaussian DAGs
. . . I guess, we better start. . .
Ingredient (A): The Likelihood Ratio Test
The Wald test is useful for testing a scalar parameter. The Likelihood Ratio Test (lrt) is more general and can be used for
testing a vector–valued parameter. More specifically:
The Likelihood Ratio Test
Within a parametric model F = {fθ(·) : θ ∈ Θ ⊆ R
p}, consider testing
H0 : θ ∈ Θ0 vs H1 : θ ∈/ Θ0
Given an iid sample Dn = {X1, . . . , Xn} ∼ fθ(·) with the associated likelihood function L(θ | Dn) = Q
i
fθ(Xi), the likelihood
ratio test rejects the null if Un > c for some suitable critical level c, with
Un =
supθ∈Θ L(θ | Dn)
supθ∈Θ0 L(θ | Dn)
=
L(θb | Dn)
L(θb0 | Dn)
,
where θb0 denotes the constrained to Θ0 mle (i.e. assuming H0 is true), whereas θb denotes the unconstrained mle.
Remarks:
• Replacing Θc
0 with Θ in the denominator has little effect on the test statistic and the unconstrained version simplifies
the theoretical properties of the test statistics.
• The likelihood ratio test is most useful when Θ0 consists of all parameter values θ such that some coordinates of θ are
fixed at particular values.
1
Let’s now look at a famous example on testing for the mean of a Normal population: one of the few cases where we have
exact, finite sample, results.
Example: Student’s t–test
Let Dn = {X1, . . . , Xn} be iid from a N(µ, σ2
) and we want to test
H0 : µ = µ0 vs H1 : µ 6= µ0 Un =
L(µb, σb | Dn)
L(µ0, σb0 | Dn)
,
where σb0 maximizes the likelihood under the null, that is, subject to µ = µ0.
After some simple but tedious algebra, it can be shown that Un > c ⇔ Tn > c0 where
Tn =

n − µ0
σ/b

n
under H0 ∼ tn−1 with σb
2 =
1
n
X
i
(Xi − X¯
n)
2
.
So the final two–sided test on the mean µ of a Normal population is:
Reject H0 if |Tn| > tn−1,α/2 (Student’s t–test).
Similarly to the Wald test, in more general situations where we are dealing with non-Gaussian (but still regular!) populations,
all we can do to appropriately tune the critical value c in order to control the type-I error probability of our lrt, is to appeal
to some suitable, broadly applicable, asymptotic result.
More specifically, here’s two classics:
Asymptotic approximation / scalar
Consider testing H0 : θ = θ0 versus H1 : θ 6= θ0 where θ ∈ R.
Then, under H0 (+ regularity conditions on the population model F),
Tn = 2 log Un
d
−→ χ
2
1
.
Hence an asymptotic level α test is: Reject H0 when Tn > χ2
1,α.
Asymptotic approximation / vector
Consider testing a null H0 : θ ∈ Θ0 ⊆ R
p where we are fixing some parameters.
Then, under H0 (+ regularity conditions on the population model F),
Tn = 2 log Un
d
−→ χ
2
ν where ν = dim(Θ) − dim(Θ0).
Hence an asymptotic level α test is: Reject H0 when Tn > χ2
ν,α.
Ingredient (B): Universal Inference
As we all should know at this point, in classical frequentist statistics, confidence sets and tests are often obtained by starting
from asymptotically Gaussian estimators or other large sample results.
As a consequence, their validity relies on large sample asymptotic theory and requires that the statistical model satisfy certain
regularity conditions. When these conditions do not hold, or the sample is not large “enough”, there is no general method
for statistical inference, and these settings are typically considered in an ad-hoc manner.
Recently a new, universal method simply based on sample splitting has been introduced which yields tests and confidence
sets for any statistical model (regular or not) and comes also with finite-sample guarantees.
Focusing on hypothesis testing, historically speaking sample splitting was first analysed computationally and theoretically
in a 1975-4-pages-long paper by sir David Cox, one of the greatest statisticians of all times, and then further discussed in his
1977 review (2, Section 3.2), where he describes the method as well known and refers to an American Statistician paper with a
wide-ranging discussion of “snooping”, “fishing”, and “hunting” in data analysis.
Honestly? An easy read way more relevant now than then!
Let’s now describe this idea in the context of lrt:
2
Universal Hypothesis Test
Let D2n = {X1, . . . , X2n} be an iid sample from a population model F having density fθ(·) and consider testing
H0 : θ ∈ Θ0 vs H1 : θ ∈/ Θ0.
To this end, (randomly) split the data D2n in two groups having the same size n (just to simplify the notation), and build the
two corresponding likelihood functions:
D2n

D
Tr
n
, D
Te
n

n
L(θ | DTr
n
) = Q
i∈DTr
n
fθ(Xi), L(θ | DTe
n
) = Q
i∈DTe
n
fθ(Xi)
o
Consider now the following two mle’s for θ
θb
Tr
0 = argmax
θ∈Θ0
L(θ | DTr
n
)
H0-constrained estimator based on the Training Data
θb
Te
= argmax
All θ
L(θ | DTe
n
)
Unconstrained estimator based on the Test Data
At this point we are ready to define the following two universal test statistics:
Un =
L

θb
Te | DTr
n

L

θb
Tr
0
| DTr
n

Split Likelihood Ratio
and Wn =
Un + U
swap
n
2
Cross-Fit Likelihood Ratio
(1)
where U
swap
n is the same as Un after swapping the roles of DTr
n and DTe
n
.
Based on Un and Wn, we get the following two univeral testing procedures:
Reject H0 if Un >
1
α
Split LRT
, and Reject H0 if Wn >
1
α
Cross-Fit LRT
(2)
By simply using Markov inequality and under no assumptions on the population model F, in Theorem 3 the Authors
show that in finite sample the Split and Cross-Fit LRTs control the type-I error probability at level α.
Ingredient (C): Gaussian DAGs and their (constrained) Likelihood
We (should) know that we use DAG models to encode the joint distribution f(x) of a random vector X = [X1, . . . , Xp]
T ∈ R
p
:
the nodes and the directed edges represent, respectively, the variables and the parent-child dependence relations between any
two variables. We also know that the joint f(x) is Markov w.r.t. a graph G if it admits the following factorization
f(x) = Yp
j=1
f

xj | pa(xj )

,
where pa(xj ) denotes the set of variables with an arrow towards Xj in the (directed) graph G.
Our main goal here, is to lay down a strategy to infer the pairwise relation imposed by the (local) Markov dependence. To get
started, first of all we restrict the scope of our analysis focusing on a Gaussian random vector X ∼ Np.
Under this assumption, we can capture the directional effects induced by directed edges by using a linear structural
equation model1
Xj =
X
k : k6=j
A[j, k] · Xk + j where j
iid∼ N1(0, σ2
), (3)
and A is a (p × p) adjacency matrix in which a nonzero entry A[j, k] in position (j, k) corresponds to a directed edge from
parent node k to child node j, with its value indicating the strength of the relation; A[j, k] = 0 when k 6∈ pa(Xj ).
Learning Goal
Given a random sample Dn = {X1, . . . , Xn}
iid∼ f(·), where Xi = [Xi,1, . . . , Xi,p]
T ∈ R
p
, infer the adjacency matrix A subject
to the requirements that A defines a directed acyclic graph.
Denoting by θ = (A, σ2
) the parameters of our model, to approach the problem from a maximum likelihood perspective, we
need to write down:
1. The (log-)likelihood function `n(θ) = ln L(θ | Dn).
2. How to introduce acyclicity constraints to drive the learning process.
1The homoscedasticity of the error {j}j is not required but useful to induce identifiability and avoid technicalities regarding equivalence classes.
In addition, individual means µj could be incorporated by adding intercepts to Equation 3. For simplicity, in what follows we set the means to zero.