Sale!

DDA 4010 – Bayesian Statistics Exercise Sheet 6

$30.00 $18.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (1 vote)

Assignment A6.1 (10.5 in Textbook):

Logistic regression variable selection: Consider a logistic regression model for predicting diabetes
as a function of x1 = number of pregnancies, x2 = blood pressure, x3 = body mass index, x4 =
diabetes pedigree and x5 = age. Using the data in azdiabetes.dat, center and scale each of the x
variables by subtracting the sample average and dividing by the sample standard deviation for each
variable.

Consider a logistic regression model of the form Pr (Yi = 1 | xi
, β, z) = e
θi/

1 + e
θi

where
θi = β0 + β1γ1xi,1 + β2γ2xi,2 + β3γ3xi,3 + β4γ4xi,4 + β5γ5xi,5.

In this model, each γj is either 0 or 1 , indicating whether or not variable j is a predictor of diabetes.
For example, if it were the case that γ = (1, 1, 0, 0, 0), then θi = β0 + β1xi,1 + β2xi,2. Obtain
posterior distributions for β and γ, using independent prior distributions for the parameters, such
that γj ∼ binary(1/2), β0 ∼ normal(0, 16) and βj ∼ normal(0, 4) for each j > 0.

• Implement a Metropolis-Hastings algorithm for approximating the posterior distribution of
β and γ. Examine the sequences β
(s)
j
and β
(s)
j × γ
(s)
j
for each j and discuss the mixing of
the chain.

• Approximate the posterior probability of the top five most frequently occurring values of γ.
How good do you think the MCMC estimates of these posterior probabilities are?
• For each j, plot posterior densities and obtain posterior means for βjγj . Also obtain
Pr (γj = 1 | x, y).

Assignment A6.2 (11.2 in Textbook):

Randomized block design: Researchers interested in identifying the optimal planting density for a
type of perennial grass performed the following randomized experiment: Ten different plots of
land were each divided into eight subplots, and planting densities of 2, 4, 6 and 8 plants per square
meter were randomly assigned to the subplots, so that there are two subplots at each density in
each plot.

At the end of the growing season the amount of plant matter yield was recorded in
metric tons per hectare. These data appear in the file pdensity.dat. The researchers want to fit
a model like y = β1 + β2x + β3x
2 + , where y is yield and x is planting density, but worry that
since soil conditions vary across plots they should allow for some across-plot heterogeneity in this
relationship.

To accommodate this possibility we will analyze these data using the hierarchical
linear model described in Section 11.1. Randomized block design: Researchers interested in
identifying the optimal planting density for a type of perennial grass performed the following
randomized experiment: Ten different plots of land were each divided into eight subplots, and
planting densities of 2, 4, 6 and 8 plants per square meter were randomly assigned to the subplots,
so that there are two subplots at each density in each plot. At the end of the growing season the
amount of plant matter yield was recorded in metric tons per hectare. These data appear in the
file pdensity. dat.

The researchers want to fit a model like y = β1 + β2x + β3x
2 + , where y is
yield and x is planting density, but worry that since soil conditions vary across plots they should
allow for some across-plot heterogeneity in this relationship. To accommodate this possibility we
will analyze these data using the hierarchical linear model described in Section 11.1.

• Before we do a Bayesian analysis we will get some ad hoc estimates of these parameters via
least squares regression. Fit the model y = β1 + β2x + β3x
2 +  using OLS for each group,
and make a plot showing the heterogeneity of the least squares regression lines. From the
least squares coefficients find ad hoc estimates of θ and Σ. Also obtain an estimate of σ
2 by
combining the information from the residuals across the groups.

• Now we will perform an analysis of the data using the following distributions as prior
distributions:
Σ
−1 ∼ Wishart 
4, Σˆ −1

θ ∼ multivariate normal (θˆ, Σ) ˆ
σ
2 ∼ inverse − gamma 
1, σˆ
2

where θˆ, Σˆ, σˆ
2 are the estimates you obtained in a). Note that this analysis is not combining
prior information with information from the data, as the “prior” distribution is based on
the observed data. However, such an analysis can be roughly interpreted as the Bayesian
analysis of an individual who has weak but unbiased prior information.

• Use a Gibbs sampler to approximate posterior expectations of β for each group j, and plot
the resulting regression lines. Compare to the regression lines in a) above and describe why
you see any differences between the two sets of regression lines.

• From your posterior samples, plot marginal posterior and prior densities of θ and the
elements of Σ. Discuss the evidence that the slopes or intercepts vary across groups.

• Suppose we want to identify the planting density that maximizes average yield over a random
sample of plots. Find the value xmax of x that maximizes expected yield, and provide a
95% posterior predictive interval for the yield of a randomly sampled plot having planting
density xmax.
Sheet 6 is due on Dec. 23rd. Submit your solutions before Dec. 23rd, 5:00 pm.