Description
Objective
After this homework, you will be familiar with:
(a) Basic plotting tools in Python, for both 1D and 2D plots. Drawing random samples in Python.
(b) Cross-validation. See Introduction to Probability for Data Science, Chapter 3.2.5.
(c) Gaussian whitening. See Introduction to Probability for Data Science Chapter 5.7.4.
(d) Basic ideas of regression. See Introduction to Probability for Data Science Chapter 7.1.4.
You will be asked some of these questions in Quiz 1.
Exercise 1: Histogram and Cross-Validation
Let X be a random variable with X ∼ N (µ, σ2
). The PDF of X is written explicitly as
fX(x) = 1
√
2πσ2
e
−
(x−µ)
2
2σ2 . (1)
(a) Let µ = 0 and σ = 1 so that X ∼ N (0, 1). Plot fX(x) using matplotlib.pyplot.plot for the range
x ∈ [−3, 3]. Use matplotlib.pyplot.savefig to save your figure.
(b) Let us investigate the use of histograms in data visualization.
(i) Use numpy.random.normal to draw 1000 random samples from N (0, 1).
(ii) Make two histogram plots using matplotlib.pyplot.hist, with the number of bins m set to 4
and 1000.
(iii) Use scipy.stats.norm.fit to estimate the mean and standard deviation of your data. Report
the estimated values.
(iv) Plot the fitted gaussian curve on top of the two histogram plots using scipy.stats.norm.pdf.
(v) (Optional) Ask yourself the following questions: Are the two histograms representative of your
data’s distribution? How are they different in terms of data representation?
(c) A practical way to estimate the optimal bin width is to make use of what is called the cross validation
estimator of risk of the dataset. Denoting h = (max data value−min data value)/m as the bin width,
with m = the number of bins (assuming you applied no rescaling to your raw data), we seek h
∗
that
minimizes the risk Jb(h), expressed as follows:
Jb(h) = 2
h(n − 1) −
n + 1
h(n − 1)
Xm
j=1
pbj
2
, (2)
where {pbj}
m
j=1 is the empirical probability of a sample falling into each bin, and n is the total number
of samples.
1
(i) Plot Jb(h) with respect to m the number of bins, for m = 1, 2, …, 200.
(ii) Find the m∗
that minimizes Jb(h), plot the histogram of your data with that m∗
.
(iii) Plot the Gaussian curve fitted to your data on top of your histogram.
Note: For additional discussions about this cross-validation technique, visit Introduction to Probability
for Data Science, Chapter 3.2.5. More advanced materials can be found in the supplementary note of
this homework.
Exercise 2: Gaussian Whitening
In this exercise, we consider the following question: suppose that we are given a random number generator
that can only generate zero-mean unit variance Gaussians, i.e., X ∼ N (0, I), how do we transform the
distribution of X to an arbitrary Gaussian distribution? We will first derive a few equations, and then verify
them with an empirical example, by drawing samples from the 2D Gaussian, applying the transform to the
dataset, and checking if the transformed dataset really takes the form of the desired Gaussian.
(a) Let X ∼ N (µ, Σ) be a 2D Gaussian. The PDF of X is given by
fX(x) = 1
p
(2π)
2|Σ|
exp
−
1
2
(x − µ)
T Σ
−1
(x − µ)
, (3)
where in this exercise we assume
X =
X1
X2
, x =
x1
x2
, µ =
2
6
, and Σ =
2 1
1 2
(4)
(i) Simplify the expression fX(x) for the particular choices of µ and Σ here. Show your derivation.
(ii) Using matplotlib.pyplot.contour, plot the contour of fX(x) for the range x ∈ [−1, 5]×[0, 10].
(b) Suppose X ∼ N (0, I). We would like to derive a transformation that can map X to an arbitrary
Gaussian.
(i) Let X ∼ N (0, I) be a d-dimensional random vector. Let A ∈ R
d×d and b ∈ R
d
. Let Y = AX +b
be an affine transformation of X. Let µY
def = E[Y ] be the mean vector and ΣY
def = E[(Y −µY )(Y −
µY )
T
] be the covariance matrix. Show that
µY = b, and ΣY = AAT
. (5)
(ii) Show that ΣY is symmetric positive semi-definite.
(iii) Under what condition on A would ΣY become a symmetric positive definite matrix?
(iv) Consider a random variable Y ∼ N (µY , ΣY ) such that
µY =
2
6
, and ΣY =
2 1
1 2
.
Determine A and b which could satisfy Equation (5).
Hint: Consider eigen-decomposition of ΣY . You may compute the eigen-decomposition numerically.
(c) Now let us verify our results from part (b) with an empirical example.
(i) Use numpy.random.multivariate_normal to draw 5000 random samples from the 2D standard
normal distribution, and make a scatter plot of the data point using matplotlib.pyplot.scatter.
2
(ii) Write a Python program using numpy.linalg.eig to obtain A given ΣY in part (b)(iv). Apply
the transformation to the data point, and make a scatter plot of the transformed data points to
check whether the transformation is correct.
(iii) (Optional) Do your results from parts (c)(i) and (ii) support your theoretical findings from part
(b)?
You can find more information about Gaussian whitening in Introduction to Probability for Data Science,
Chapter 5.7.4.
Exercise 3: Linear Regression
Let us consider a polynomial fitting problem. We assume the following model:
y = β0 + β1L1(x) + β2L2(x) + . . . + βpLp(x) + ϵ, (6)
where Lp(x) is the p-th Legendre polynomial, βj are the coefficients, and ϵ is the error term. In Python, if
you have specified a list of values of x, evaluating the Legendre polynomial is quite straight forward:
import numpy as np
from scipy.special import eval_legendre
x = np.linspace(-1,1,50) # 50 points in the interval [-1,1]
L4 = eval_legendre(4,x) # evaluate the 4th order Legendre polynomial for x
(a) Let β0 = −0.001, β1 = 0.01, β2 = 0.55, β3 = 1.5, β4 = 1.2, and let ϵ ∼ Gaussian(0, 0.2
2
). Generate 50
points of y over the interval x = np.linspace(-1,1,50). That is,
x = np.linspace(-1,1,50) # 50 points in the interval [-1,1]
y = … # fill this line
Scatter plot the data.
(b) Given the N = 50 data points, formulate the linear regression problem. Specifically, write down the
expression
βb = argmin
β
∥y − Xβ∥
2
. (7)
What are y, X, and β? Derive the optimal solution for this simple regression problem. Express your
answer in terms of X and y.
(c) Write a Python code to compute the solution. Overlay your predicted curve with the scattered plot.
For solving the regression problem, you can call numpy.linalg.lstsq.
(d) For the y you have generated, make 5 outlier points using the code below:
# …
idx = [10,16,23,37,45] # these are the locations of the outliers
y[idx] = 5 # set the outliers to have a value 5
# …
Run your code in (c) again. Comment on the difference.
(e) Consider the optimization
βb = argmin
β
∥y − Xβ∥1. (8)
3
Convert the problem into a linear programming problem. Express your solution in the linear programming form:
minimize
x
c
T x
subject to Ax ≤ b. (9)
What are c, x, A, and b?
(f) Solve the linear programming problem in Python using scipy.optimize.linprog, for the corrupted
data in (d). Scatter plot the data, and overlay with your predicted curve. Hint: Remember to set
bounds=(None,None) when you call scipy.optimize.linprog, because the variables in linprog are
non-negative by default.
For this problem, you may want to check Introduction to Probability for Data Science, Chapter 7.1.4.
Exercise 4: Project: Check Point 1
By now I believe you should have read the course project instructions. If not, please read them now. The
objective of this series of “check points” is to keep track of your progress, so that you will not wait until the
last minute and then become panic. For check point 1, I want you to complete the following tasks.
1. Go through every topic listed on the course project webpage. You don’t have to understand the
technical details, but you need to know what’s the purspose of the work. Then go to this Google Form
https://forms.gle/435QAHQ6n83gYW7g6 to fill in your preference. The deadline of this form is the
same as the homework deadline. If you don’t finish this survey, we will have to assign your project
preference randomly.
2. Latex. I only accept final reports typed in Latex, using the ICML 2021 template. You can use overleaf
(a free online platform) to type your report. Download the ICML template, put it in overleaf. Change
the title to the paper title of your first choice, and change the author name to your name.
When typing your name, please tell us: Your name, your major (e.g., ECE, AAE, etc), and your level
(e.g., Online MS, PhD, Undergrad, etc)
Clear the rest of the contents of in the document and attach this empty document to the end of your
homework submission. (By default the ICML template uses the review mode. Please switch it to the
camera ready mode.)
4