## Description

Problem 1 (Variance and covariance, 6 points)

Let X and Y be two continuous independent random variables.

(a) Starting from the definition of independence, show that the independence of X and Y implies that their covariance

is zero.

(b) For a scalar constant a, show the following two properties, starting from the definition of expectation:

E(X + aY) = E(X) + aE(Y)

var(X + aY) = var(X) + a

2var(Y)

Problem 2 (Densities, 5 points)

Answer the following questions:

(a) Can a probability density function (pdf) ever take values greater than 1?

(b) Let X be a univariate normally distributed random variable with mean 0 and variance 1/100. What is the pdf of

X?

(c) What is the value of this pdf at 0?

(d) What is the probability that X = 0?

Problem 3 (Calculus, 4 points)

Let x, y ∈ Rm and A ∈ Rm×m. Please answer the following questions, writing your answers in vector notation.

(a) What is the gradient with respect to x of x

Ty?

(b) What is the gradient with respect to x of x

Tx?

(c) What is the gradient with respect to x of x

TA?

(d) What is the gradient with respect to x of x

TAx?

Problem 4 (Linear Regression, 10pts)

Suppose that X ∈ Rn×m with n ≥ m and Y ∈ Rn

, and that Y ∼ N (Xβ, σ

2

I). In this question you will derive the result

that the maximum likelihood estimate βˆ of β is given by

βˆ = (X

TX)

−1X

TY

(a) What are the expectation and covariance matrix of βˆ, for a given true value of β?

(b) Show that maximizing the likelihood is equivalent to minimizing the squared error

∑

n

i=1

(yi − xiβ)

2

. [Hint: Use ∑

n

i=1

a

2

i = a

T

a]

(c) Write the squared error in vector notation, (see above hint), expand the expression, and collect like terms. [Hint:

Use β

Tx

Ty = y

Txβ (why?) and x

Tx is symmetric ]

(d) Take the derivative of this expanded expression with respect to β to show the maximum likelihood estimate βˆ as

above. [Hint: Use results 3.c and 3.d for derivatives in vector notation.]

1

Problem 5 (Ridge Regression, 10pts)

Suppose we place a normal prior on β. That is, we assume that β ∼ N (0, τ

2

I).

(a) Show that the MAP estimate of β given Y in this context is

βˆ MAP = (X

TX + λI)

−1X

TY

where λ = σ

2/τ

2

.

Estimating β in this way is called ridge regression because the matrix λI looks like a “ridge”. Ridge regression is a

common form of regularization that is used to avoid the overfitting that happens when the sample size is close to

the output dimension in linear regression.

(b) Show that ridge regression is equivalent to adding m additional rows to X where the j-th additional row has its

j-th entry equal to √

λ and all other entries equal to zero, adding m corresponding additional entries to Y that are

all 0, and and then computing the maximum likelihood estimate of β using the modified X and Y.

2

Problem 6 (Gaussians in high dimensions, 10pts)

In this question we will investigate how our intuition for samples from a Gaussian may break down in higher dimensions. Consider samples from a D-dimensional unit Gaussian x ∼ N (0D, ID) where 0D indicates a column vector of D

zeros and ID is a D × D identity matrix.

1. Starting with the definition of Euclidean norm, quickly show that the distance of x from the origin is √

xTx

2. In low-dimensions our intuition tells us that samples from the unit Gaussian will be near the origin. Draw 10000

samples from a D = 1 Gaussian and plot a normalized histogram for the distance of those samples from the

origin. Does this confirm your intuition that the samples will be near the origin?

3. Draw 10000 samples from D = {1, 2, 3, 10, 100} Gaussians and, on a single plot, show the normalized histograms

for the distance of those samples from the origin. As the dimensionality of the Gaussian increases, what can you

say about the expected distance of the samples from the Gaussian’s mean (in this case, origin).

4. From Wikipedia, if xi are k independent, normally distributed random variables with means µi and standard

deviations σi

then the statistic Y =

q

∑

k

i=1

(

xi−µi

σi

)

2

is distributed according to the χ-distribution. On the previous

normalized histogram, plot the probability density function (pdf) of the χ-distribution for k = {1, 2, 3, 10, 100}.

5. Taking two samples from the D-dimensional unit Gaussian, xa, xb ∼ N (0D, ID) how is xa − xb distributed? Using

the above result about χ-distribution, how is ||xa − xb

||2 distributed? (Hint: start with a X -distributed random

variable and use the change of variables formula.) Plot the pdfs of this distribution for k = {1, 2, 3, 10, 100}.

How does the distance between samples from a Gaussian behave as dimensionality increases? Confirm this by

drawing two sets of 1000 samples from the D-dimensional unit Gaussian. On the plot of the χ-distribution pdfs,

plot the normalized histogram of the distance between samples from the first and second set.

6. In lecture we saw examples of interpolating between latent points to generate convincing data. Given two samples from a gaussian xa, xb ∼ N (0D, ID) the linear interpolation between them xα is defined as a function of

α ∈ [0, 1]

lin interp(α, xa, xb

) = αxa + (1 − α)xb

For two sets of 1000 samples from the unit gaussian in D-dimensions, plot the average log-likelihood along the

linear interpolations between the pairs of samples as a function of α. (i.e. for each pair of samples compute

the log-likelihood along a linear space of interpolated points between them, N (xα|0, I) for α ∈ [0, 1]. Plot the

average log-likelihood over all the interpolations.) Do this for D = {1, 2, 3, 10, 100}, one plot per dimensionality.

Comment on the log-likelihood under the unit Gaussian of points along the linear interpolation. Is a higher

log-likelihood for the interpolated points necessarily better? Given this, is it a good idea to linearly interpolate

between samples from a high dimensional Gaussian?

7. Instead we can interpolate in polar coordinates: For α ∈ [0, 1] the polar interpolation is

polar interp(α, xa, xb

) = √

αxa +

q

(1 − α)xb

This interpolates between two points while maintaining Euclidean norm. On the same plot from the previous

question, plot the probabilitiy density of the polar interpolation between pairs of samples from two sets of 1000

samples from D-dimensional unit Gaussians for D = {1, 2, 3, 10, 100}. Comment on the log-likelihood under the

unit Gaussian of points along the polar interpolation. Give an intuative explanation for why polar interpolation

is more suitable than linear interpolation for high dimensional Gaussians. For 6. and 7. you should have one

plot for each D with two curves on each.

8. (Bonus 5pts) In the previous two questions we compute the average loglikelihood of the linear and polar interpolations under the unit gaussian. Instead, consider the norm along the interpolation, p

xT

α xα. As we saw

previously, this is distributed according to the X -distribution. Compute and plot the average log-likelihood of

the norm along the two interpolations under the the X -distribution for D = {1, 2, 3, 10, 100}, i.e. XD(

p

xT

α xα).

There should be one plot for each D, each with two curves corresponding to log-likelihood of linear and polar interpolations. How does the log-likelihood along the linear interpolation compare to the log-likelihood of the true

samples (endpoints)? Using your answer for questions 3 and 4, provide geometric intuition for the log-likelihood

along the linear and polar interpolations. Use this to further justify your explanation for the suitability of polar

v.s. linear interpolation.

3