Description
Problem 1 (Variance and covariance, 6 points)
Let X and Y be two continuous independent random variables.
(a) Starting from the definition of independence, show that the independence of X and Y implies that their covariance
is zero.
(b) For a scalar constant a, show the following two properties, starting from the definition of expectation:
E(X + aY) = E(X) + aE(Y)
var(X + aY) = var(X) + a
2var(Y)
Problem 2 (Densities, 5 points)
Answer the following questions:
(a) Can a probability density function (pdf) ever take values greater than 1?
(b) Let X be a univariate normally distributed random variable with mean 0 and variance 1/100. What is the pdf of
X?
(c) What is the value of this pdf at 0?
(d) What is the probability that X = 0?
Problem 3 (Calculus, 4 points)
Let x, y ∈ Rm and A ∈ Rm×m. Please answer the following questions, writing your answers in vector notation.
(a) What is the gradient with respect to x of x
Ty?
(b) What is the gradient with respect to x of x
Tx?
(c) What is the gradient with respect to x of x
TA?
(d) What is the gradient with respect to x of x
TAx?
Problem 4 (Linear Regression, 10pts)
Suppose that X ∈ Rn×m with n ≥ m and Y ∈ Rn
, and that Y ∼ N (Xβ, σ
2
I). In this question you will derive the result
that the maximum likelihood estimate βˆ of β is given by
βˆ = (X
TX)
−1X
TY
(a) What are the expectation and covariance matrix of βˆ, for a given true value of β?
(b) Show that maximizing the likelihood is equivalent to minimizing the squared error
∑
n
i=1
(yi − xiβ)
2
. [Hint: Use ∑
n
i=1
a
2
i = a
T
a]
(c) Write the squared error in vector notation, (see above hint), expand the expression, and collect like terms. [Hint:
Use β
Tx
Ty = y
Txβ (why?) and x
Tx is symmetric ]
(d) Take the derivative of this expanded expression with respect to β to show the maximum likelihood estimate βˆ as
above. [Hint: Use results 3.c and 3.d for derivatives in vector notation.]
1
Problem 5 (Ridge Regression, 10pts)
Suppose we place a normal prior on β. That is, we assume that β ∼ N (0, τ
2
I).
(a) Show that the MAP estimate of β given Y in this context is
βˆ MAP = (X
TX + λI)
−1X
TY
where λ = σ
2/τ
2
.
Estimating β in this way is called ridge regression because the matrix λI looks like a “ridge”. Ridge regression is a
common form of regularization that is used to avoid the overfitting that happens when the sample size is close to
the output dimension in linear regression.
(b) Show that ridge regression is equivalent to adding m additional rows to X where the j-th additional row has its
j-th entry equal to √
λ and all other entries equal to zero, adding m corresponding additional entries to Y that are
all 0, and and then computing the maximum likelihood estimate of β using the modified X and Y.
2
Problem 6 (Gaussians in high dimensions, 10pts)
In this question we will investigate how our intuition for samples from a Gaussian may break down in higher dimensions. Consider samples from a D-dimensional unit Gaussian x ∼ N (0D, ID) where 0D indicates a column vector of D
zeros and ID is a D × D identity matrix.
1. Starting with the definition of Euclidean norm, quickly show that the distance of x from the origin is √
xTx
2. In low-dimensions our intuition tells us that samples from the unit Gaussian will be near the origin. Draw 10000
samples from a D = 1 Gaussian and plot a normalized histogram for the distance of those samples from the
origin. Does this confirm your intuition that the samples will be near the origin?
3. Draw 10000 samples from D = {1, 2, 3, 10, 100} Gaussians and, on a single plot, show the normalized histograms
for the distance of those samples from the origin. As the dimensionality of the Gaussian increases, what can you
say about the expected distance of the samples from the Gaussian’s mean (in this case, origin).
4. From Wikipedia, if xi are k independent, normally distributed random variables with means µi and standard
deviations σi
then the statistic Y =
q
∑
k
i=1
(
xi−µi
σi
)
2
is distributed according to the χ-distribution. On the previous
normalized histogram, plot the probability density function (pdf) of the χ-distribution for k = {1, 2, 3, 10, 100}.
5. Taking two samples from the D-dimensional unit Gaussian, xa, xb ∼ N (0D, ID) how is xa − xb distributed? Using
the above result about χ-distribution, how is ||xa − xb
||2 distributed? (Hint: start with a X -distributed random
variable and use the change of variables formula.) Plot the pdfs of this distribution for k = {1, 2, 3, 10, 100}.
How does the distance between samples from a Gaussian behave as dimensionality increases? Confirm this by
drawing two sets of 1000 samples from the D-dimensional unit Gaussian. On the plot of the χ-distribution pdfs,
plot the normalized histogram of the distance between samples from the first and second set.
6. In lecture we saw examples of interpolating between latent points to generate convincing data. Given two samples from a gaussian xa, xb ∼ N (0D, ID) the linear interpolation between them xα is defined as a function of
α ∈ [0, 1]
lin interp(α, xa, xb
) = αxa + (1 − α)xb
For two sets of 1000 samples from the unit gaussian in D-dimensions, plot the average log-likelihood along the
linear interpolations between the pairs of samples as a function of α. (i.e. for each pair of samples compute
the log-likelihood along a linear space of interpolated points between them, N (xα|0, I) for α ∈ [0, 1]. Plot the
average log-likelihood over all the interpolations.) Do this for D = {1, 2, 3, 10, 100}, one plot per dimensionality.
Comment on the log-likelihood under the unit Gaussian of points along the linear interpolation. Is a higher
log-likelihood for the interpolated points necessarily better? Given this, is it a good idea to linearly interpolate
between samples from a high dimensional Gaussian?
7. Instead we can interpolate in polar coordinates: For α ∈ [0, 1] the polar interpolation is
polar interp(α, xa, xb
) = √
αxa +
q
(1 − α)xb
This interpolates between two points while maintaining Euclidean norm. On the same plot from the previous
question, plot the probabilitiy density of the polar interpolation between pairs of samples from two sets of 1000
samples from D-dimensional unit Gaussians for D = {1, 2, 3, 10, 100}. Comment on the log-likelihood under the
unit Gaussian of points along the polar interpolation. Give an intuative explanation for why polar interpolation
is more suitable than linear interpolation for high dimensional Gaussians. For 6. and 7. you should have one
plot for each D with two curves on each.
8. (Bonus 5pts) In the previous two questions we compute the average loglikelihood of the linear and polar interpolations under the unit gaussian. Instead, consider the norm along the interpolation, p
xT
α xα. As we saw
previously, this is distributed according to the X -distribution. Compute and plot the average log-likelihood of
the norm along the two interpolations under the the X -distribution for D = {1, 2, 3, 10, 100}, i.e. XD(
p
xT
α xα).
There should be one plot for each D, each with two curves corresponding to log-likelihood of linear and polar interpolations. How does the log-likelihood along the linear interpolation compare to the log-likelihood of the true
samples (endpoints)? Using your answer for questions 3 and 4, provide geometric intuition for the log-likelihood
along the linear and polar interpolations. Use this to further justify your explanation for the suitability of polar
v.s. linear interpolation.
3