Homework Assignment 5 probability density function

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (3 votes)

1. The probability density function of normal distribution is defined as
f(x) = 1
Z
exp

1
2
(x− µ)

−1
(x− µ)

,
where
Z =
Z
x∈Rd
exp

1
2
(x− µ)

−1
(x− µ)

dx
= (2π)
d/2
|Σ|
1/2
,
where |Σ| is the determinant of the covariance matrix.
Let us assume that the covariance matrix Σ is a diagonal matrix, as below:
Σ =








σ
2
1
0 ··· 0
0 σ
2
2
··· 0
.
.
. 0 ··· 0
.
.
.
.
.
. ···
.
.
.
0 0 ··· σ
2
d








.
The probability density function simplifies to
f(x) =
d

i=1
1

2πσi
exp

1
2
1
σ
2
i
(xi − µi)
2

.
Show that this is indeed true.
1
2.
(a) Show that the following equation, called Bayes’ rule, is true.
p(Y|X) = p(X|Y)p(Y)
p(X)
.
(b) We learned the definition of expectation:
E[X] = ∑
x∈Ω
xp(x).
Assuming that X and Y are discrete random variables, show that
E[X +Y] = E[X] +E[Y].
(c) Further assume that c ∈ R is a scalar and is not a random variable, show that
E[cX] = cE[X].
(d) We learned the definition of variance:
Var(X) = ∑
x∈Ω
(x−E[X])2
p(x).
Assuming X being a discrete random variable, show that
Var(X) = E

X
2

−(E[X])2
.
2
3. An optimal linear regression machine (without any regularization term), that minimizes the empirical cost function given a training set
Dtra = {(x1,y

1
),…,(xN,y

N
)},
can be found directly without any gradient-based optimization algorithm. Assuming
that the distance function is defined as
D(M∗
(x),M,x) = 1
2
kM∗
(x)−M(x)k
2
2 =
1
2
q

k=1
(y

k −yk)
2
,
derive the optimal weight matrix W. (Hint: Moore–Penrose pseudoinverse)
3
4. Suppose that we have a data distribution Y = f(X)+ε, where X is a random vector,
ε is an independent random variable with zero mean and fixed but unknown variance
σ
2
, and f is an unknown deterministic function that maps a vector into a scalar.
Now, we wish to approximate f(x) with our own model ˆf(x;Θ) with some learnable parameters Θ.
(a) Show that considering all possible ˆf and Θ, the minimum of L2 loss
EX[(Y − ˆf(X;Θ))2
]
is achieved when for all x,
ˆf(x;Θ) = f(x)
(Hint: find the minimum of L2 loss for a single example first.)
(b) If we train the same model varying initializations and examples from the underlying data distribution, we may end up with different Θ. So we can also consider
Θ as a random variable if we fix ˆf .
Show that for a single unseen input vector x0 and a fixed ˆf , the expected squared
error between the ground truth y0 = f(x0)+ε and the prediction ˆf(x0;Θ) can be
decomposed into:
E[(y0 − ˆf(x0;Θ))2
] =
E[ f(x0)− ˆf(x0;Θ)]2
+Var[
ˆf(x0;Θ)] +σ
2
(Side note: this is usually known as the bias-variance decomposition, closely
related to bias-variance tradeoff, and other concepts such as underfitting and
overfitting.)