Sale!

CS698X Topics in Probabilistic Modeling and Inference Homework 1

$30.00 $18.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (8 votes)

Problem 1 (14 marks)
(When You Integrate Out..) Suppose x is a scalar random variable drawn from a univariate Gaussian p(x|η) =
N (x|0, η). The variance η itself is drawn from an exponential distribution: p(η|γ) = Exp(η|γ
2/2), where
γ > 0. Note that the exponential distribution is defined as Exp(x|λ) = λ exp(−λx). Derive the expression of
the marginal distribution of x, i.e., p(x|γ) = R
p(x|η)p(η|γ)dη after integrating out η. What does the marginal
distribution p(x|γ) mean?
Plot both p(x|η) and p(x|γ) and include in the writeup PDF itself. What difference do you see between the
shapes of these two distributions? Note: You don’t need to submit the code used to generate the plots. Just the
plots (appropriately labeled) are fine.
Hint: You will notice that R
p(x|η)p(η|γ)dη is a hard to compute integral. However, the solution does have a
closed form expression. One way to get the result is to compute the moment generating function (MGF)1
of R
p(x|η)p(η|γ)dη (note that this is a p.d.f.) and compare the obtained MGF expression with the MGFs of
various p.d.f.s given in the table on the following Wikipedia page: https://en.wikipedia.org/wiki/
Moment-generating_function, and identify which p.d.f.’s MGF it matches with. That will give you the
form of distribution p(x|γ). Specifically, name this distribution and identify its parameters.
Problem 2 (14 marks)
(It Gets Better..) Recall that, for a Bayesian linear regression model with likelihood p(y|x, w) = N (w>x, β−1
)
and prior p(w) = Nor(0, λ−1
I), the predictive posterior is p(y∗|x∗) = N (µ
>
N x∗, β−1 + x
>
∗ ΣN x∗) =
N (µ
>
N x∗, σ2
N (x∗)), where we have defined σ
2
N (x∗) = β
−1+x
>
∗ ΣN x∗ and µN and ΣN are the mean and covariance matrix of the Gaussian posterior on w, s.t., µN = Σ(β
PN
n=1 ynxn) and ΣN = (β
PN
n=1 xnx
>
n + λI)
−1
.
Here, we have used the subscript N to denote that the model is learned using N training examples. As the training
set size N increases, what happens to the variance of the predictive posterior? Does it increase or decrease or
remain the same? You must also prove your answer formally. You might find the following identity useful: You
may make use the following matrix identity:
(M + vv>)
−1 = M−1 −
(M−1v)(v
>M−1
)
1 + v>M−1v
Where M denotes a square matrix and v denotes a column vector.
Problem 3 (10 marks)
(Distribution of Empirical Mean of Gaussian Observations) Consider N scalar-valued observations x1, . . . , xN
drawn i.i.d. from N (µ, σ2
). Consider their empirical mean x¯ =
1
N
PN
n=1 xn. Representing the empirical mean
as a linear transformation of a random variable, derive the probability distribution of x¯. Briefly explain why the
result makes intuitive sense.
Problem 4 (20 marks)
(Benefits of Probabilistic Joint Modeling-1) Consider a dataset of test-scores of students from M schools in
a district: x = {x
(m)}M
m=1 = {x
m
1
, . . . , x
(m)
Nm
}M
m=1, where Nm denotes the number of students in school m.
Assume the scores of students in school m are drawn independently as x
(m)
n ∼ N (µm, σ2
) where the Gaussian’s
mean µm is unknown and the variance σ
2
is same for all schools and known (for simplicity). Assume the
means µ1, . . . , µM of the M Gaussians to also be Gaussian distributed µm ∼ N (µ0, σ2
0
) where µ0 and σ
2
0
are
hyperparameters.
1. Assume the hyperparameters µ0 and σ
2
0
to be known. Derive the posterior distribution of µm and write
down the mean and variance of this posterior distribution. Note: While you can derive it the usual way, the
1MGF of a p.d.f. p(x) is defined as MX(t) = R ∞
−∞ e
txp(x)dx
2
derivation will be much more compact if you use the result of Problem 2 and think of each school’s data as
a single observation (the empirical mean of observations) having the distribution derived in Problem 3.
2. Assume the hyperparameter µ0 to be unknown (but still keep σ
2
0
as fixed for simplicity). Derive the
marginal likelihood p(x|µ0, σ2
, σ2
0
) and use MLE-II to estimate µ0 (note again that σ
2
and σ
2
0
are known
here). Note: Looking at the form/expression of the marginal likelihood, if the MLE-II result looks obvious
to you, you may skip the derivation and directly write the result.
3. Consider using this MLE-II estimate of µ0 from part (2) in the posteriors of each µm you derived in part
(1). Do you see any benefit in using the MLE-II estimate of µ0 as opposed to using a known value of µ0?
Problem 5 (12 marks)
(Benefits of Probabilistic Joint Modeling-2) Suppose we have student data from M schools where Nm denotes
the number of students in school m. The data for each school m = 1, . . . , M is in the following form: For
student n in school m, there is a response variable (e.g., score in some exam) y
(m)
n ∈ R and a feature vector
x
(m)
n ∈ R
D.
Assume a linear regression model for these scores, i.e., p(y
(m)
n |x
(m)
n , wm) = N (y
(m)
n |w>
mx
(m)
n , β−1
), where
wm ∈ R
D denotes the regression weight vector for school m, and β is known. Note that this can also be denoted
as p(y
(m)
|X(m)
, wm) = N (y
(m)
|X(m)wm, β−1
IN ), where y
(m)
is Nm × 1 and X(m)
is Nm × D. Assume a
prior p(wm) = N (wm|w0, λ−1
ID), λ to be known and w0 to be unknown.
Derive the expression for the log of the MLE-II objective for estimating w0. You do not need to optimize this
objective w.r.t. w0; just writing down the final expression of objective function is fine. Also state what is the
benefit of this approach as opposed to fixing w0 to some value, if our goal is to learn the school-specific weight
vectors w1, . . . , wM? (Feel free to make direct use of properties of Gaussian distributions).
Problem 6 (30 marks): Programming Assignment
(Bayesian Linear Regression) Consider a toy data set consisting of 10 training examples {xn, yn}
10
n=1 with each
input xn as well as the output yn being scalars. The data is given below.
x = [−2.23, −1.30, −0.42, 0.30, 0.33, 0.52, 0.87, 1.80, 2.74, 3.62];
y = [1.01, 0.69, −0.66, −1.34, −1.75, −0.98, 0.25, 1.57, 1.65, 1.51]
We would like to learn a Bayesian linear regression model using this data, assuming a Gaussian likelihood model
for the outputs with fixed noise precision β = 4. However, instead of working with the original scalar-valued
inputs, we will map each input x using a degree-k polynomial as φk(x) = [1, x, x2
, . . . , xk
]
>. Note that, when
using the mapping φk, each original input becomes k + 1 dimensional. Denote the entire set of mapped inputs as
φk(x), a 10 × (k + 1) matrix. Consider k = 1, 2, 3 and 4, and learn a Bayesian linear regression model for each
case. Assume the following prior on the regression weights: p(w) = N (w|0,I) with w ∈ R
k+1
.
1. For each k, compute the posterior of w and show a plot with 10 random functions drawn from the inferred
posterior (show the functions for the input range x ∈ [−4, 4]). Also show the original training examples on
the same plot to illustrate how well the functions fit the training data.
2. For each k, compute and plot the mean of the posterior predictive p(y∗|φk(x∗), φk(x), y, β) on the interval
x∗ ∈ [−4, 4]. On the same plot, also show the predictive posterior mean plus-and-minus two times the
predictive posterior standard deviation.
3. Compute the log marginal likelihood log p(y | φk(x), β) of the training data for each of the 4 mappings
k = 1, 2, 3, 4. Which of these 4 “models” seems to explain the data the best?
3
4. Using the MAP estimate wMAP , Compute the log likelihood log p(y|wMAP , φk(x), β) for each k. Which
of these 4 models seems to have the highest log likelihood? Is your answer the same as that based on
the log marginal likelihood (part 3)? Which of these two criteria (highest log likelihood or highest log
marginal likelihood) do you think is more reasonable to select the best model and why?
5. For your best model, suppose you could include an additional training input x
0
(along with its output y
0
) to
“improve” your learned model using this additional example. Where in the region x ∈ [−4, 4] would you
like the chosen x
0
to be? Explain your answer briefly,
Your implementation should be in Python notebook (and should not use an existing implementation of Bayesian
linear regression from any library).
Submit the plots as well as the code in a single zip file (named yourrollnumber.zip).
4