## Description

1. The probability density function of normal distribution is defined as

f(x) = 1

Z

exp

−

1

2

(x− µ)

>Σ

−1

(x− µ)

,

where

Z =

Z

x∈Rd

exp

−

1

2

(x− µ)

>Σ

−1

(x− µ)

dx

= (2π)

d/2

|Σ|

1/2

,

where |Σ| is the determinant of the covariance matrix.

Let us assume that the covariance matrix Σ is a diagonal matrix, as below:

Σ =

σ

2

1

0 ··· 0

0 σ

2

2

··· 0

.

.

. 0 ··· 0

.

.

.

.

.

. ···

.

.

.

0 0 ··· σ

2

d

.

The probability density function simplifies to

f(x) =

d

∏

i=1

1

√

2πσi

exp

−

1

2

1

σ

2

i

(xi − µi)

2

.

Show that this is indeed true.

1

2.

(a) Show that the following equation, called Bayes’ rule, is true.

p(Y|X) = p(X|Y)p(Y)

p(X)

.

(b) We learned the definition of expectation:

E[X] = ∑

x∈Ω

xp(x).

Assuming that X and Y are discrete random variables, show that

E[X +Y] = E[X] +E[Y].

(c) Further assume that c ∈ R is a scalar and is not a random variable, show that

E[cX] = cE[X].

(d) We learned the definition of variance:

Var(X) = ∑

x∈Ω

(x−E[X])2

p(x).

Assuming X being a discrete random variable, show that

Var(X) = E

X

2

−(E[X])2

.

2

3. An optimal linear regression machine (without any regularization term), that minimizes the empirical cost function given a training set

Dtra = {(x1,y

∗

1

),…,(xN,y

∗

N

)},

can be found directly without any gradient-based optimization algorithm. Assuming

that the distance function is defined as

D(M∗

(x),M,x) = 1

2

kM∗

(x)−M(x)k

2

2 =

1

2

q

∑

k=1

(y

∗

k −yk)

2

,

derive the optimal weight matrix W. (Hint: Moore–Penrose pseudoinverse)

3

4. Suppose that we have a data distribution Y = f(X)+ε, where X is a random vector,

ε is an independent random variable with zero mean and fixed but unknown variance

σ

2

, and f is an unknown deterministic function that maps a vector into a scalar.

Now, we wish to approximate f(x) with our own model ˆf(x;Θ) with some learnable parameters Θ.

(a) Show that considering all possible ˆf and Θ, the minimum of L2 loss

EX[(Y − ˆf(X;Θ))2

]

is achieved when for all x,

ˆf(x;Θ) = f(x)

(Hint: find the minimum of L2 loss for a single example first.)

(b) If we train the same model varying initializations and examples from the underlying data distribution, we may end up with different Θ. So we can also consider

Θ as a random variable if we fix ˆf .

Show that for a single unseen input vector x0 and a fixed ˆf , the expected squared

error between the ground truth y0 = f(x0)+ε and the prediction ˆf(x0;Θ) can be

decomposed into:

E[(y0 − ˆf(x0;Θ))2

] =

E[ f(x0)− ˆf(x0;Θ)]2

+Var[

ˆf(x0;Θ)] +σ

2

(Side note: this is usually known as the bias-variance decomposition, closely

related to bias-variance tradeoff, and other concepts such as underfitting and

overfitting.)