Description
Part 1
PCA
1. PCA can be explained from two different perspectives. What are the two perspectives?
2. The first principal direction is the direction in which the projections of the data points have the largest
variance in the input space. We use ππ1 to represent the first/largest eigenvalue of the covariance matrix, π€π€1
to denote the corresponding principal vector/direction (π€π€1 has unit length i.e., L2 norm is 1), ππ to represent
the sample mean, and π₯π₯ to represent a data point. The deviation of π₯π₯ from the mean ππ is π₯π₯ β ππ.
π¦π¦ = ππππππ(π₯π₯) is implemented in sk-learn with “whiten=True”, and the number of components/elements of
π¦π¦ is usually less than the number of components/elements of π₯π₯
(1) what is the scalar-projection of π₯π₯ in the direction of π€π€1 ?
(2) what is the scalar-projection of the deviation π₯π₯ β ππ in the direction of π€π€1?
(3) what is the first component of π¦π¦ ?
note: compute π¦π¦ using π€π€1 , π₯π₯, ππ, and ππ1
(4) assuming π¦π¦ only has one component, then we do inverse transform to recover the input
π₯π₯οΏ½ = ππππππβ1(π¦π¦)
compute π₯π₯οΏ½ using ππ, π¦π¦, ππ1 and π€π€1: π₯π₯οΏ½ =? ? ?
(5) assuming π₯π₯ and y have the same number of elements, and we do inverse transform to recover the input
π₯π₯οΏ½ = ππππππβ1(π¦π¦)
what is the value of π₯π₯ β π₯π₯οΏ½ ?
3. Show that PCA is a linear transform of π₯π₯ β ππ
Note: must use the definition on http://mathworld.wolfram.com/LinearTransformation.html
Maximum Likelihood Estimation and NLL loss
(This is a general method to estimate parameters of a PDF using data samples)
4. Maximum Likelihood Estimation when the PDF is an exponential distribution.
Suppose we have N i.i.d. (independently and identically distributed) data samples {π₯π₯1, π₯π₯2, π₯π₯3, β¦ , π₯π₯ππ}
generated from a PDF which is assumed to be an exponential distribution. π₯π₯ππ β β+ ππππππ ππ = 1 π‘π‘π‘π‘ ππ, which
means they are positive scalars. This is the PDF:
ππ(π₯π₯) = οΏ½
ππππβππππ ππππππ π₯π₯ β₯ 0
0 ππππβππππππππππππ
Your task is to build an NLL (negative log likelihood) loss function to estimate the parameter ππ of the PDF
from the data samples.
(1) write the NLL loss function: it is a function of the parameter ππ
(2) take the derivative of the loss with respect to ππ, and set the result to 0.
After some calculations, you will obtain an equation about ππ =ββββββ
Hint: read NLL in the lecture of GMM
5. Maximum Likelihood Estimation when the PDF is histogram-like.
A histogram-like PDF ππ(π₯π₯) is defined on a 1-dimensional (1D) space that is divided into fixed
regions/intervals. So, ππ(π₯π₯) takes constant value βππ in the i-th region. There are K regions. Thus,
{β1, β2, β¦ , βπΎπΎ} is the set of (unknown) parameters of the PDF. Also, β βππΞππ
πΎπΎ
ππ=1 = 1, where Ξππ is the width
of the i-th region.
Now, we have a dataset of N samples {π₯π₯1, π₯π₯2, π₯π₯3, β¦ , π₯π₯ππ}, and ππππ is the number of samples in the i-th region.
The task is to find the best parameters of the PDF using the samples.
(1) write the NLL loss function: it is a function of the parameters
Note: it is a constrained optimization problem, we need to use Lagrange multiplier to convert constrained
optimization to unconstrained optimization. Thus, add ππ(β βππΞππ
πΎπΎ
ππ=1 β 1) to the loss function, where ππ is the
Lagrange multiplier.
(2) take the derivative of the loss with respect to βππ, set it to 0, and obtain the best parameters along with
the value of ππ.
Part 2
Complete the task in H2P2T1.ipynb and H2P2T2.ipynb
Note: It is very time consuming to fit a GMM to high dimensional data, and therefore PCA + GMM is the
“standard” approach.
Grading: the number of points
Undergraduate Student Graduate Student
1 (PCA) 2 2
2 (PCA) 15 10
3 (PCA) 3 3
4 (NLL) 5 bonus points 5
5 (NLL) N.A. 5 bonus points
H1P2T1 15 15
H2P2T2 15 15
Total number of points 50 +5 50 + 5
Extra Reading
PCA is widely used in many applications. Do a google scholar search with PCA + some field, e.g., PCA
+bioinformatics or PCA + finance, you will find relevant papers.
https://www.nature.com/articles/s41467-018-04608-8
There are many variants of PCA, such as sparse PCA and kernel PCA that are implemented in sk-learn.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7798&rep=rep1&type=pdf
https://www.di.ens.fr/sierra/pdfs/icml09.pdf
https://www.di.ens.fr/~fbach/sspca_AISTATS2010.pdf
Which one is good for your application? Test different algorithms and find the best. Remember that
machine learning is more like an experimental science: you need to run lots of experiments.