## Description

Part 1

PCA

1. PCA can be explained from two different perspectives. What are the two perspectives?

2. The first principal direction is the direction in which the projections of the data points have the largest

variance in the input space. We use ππ1 to represent the first/largest eigenvalue of the covariance matrix, π€π€1

to denote the corresponding principal vector/direction (π€π€1 has unit length i.e., L2 norm is 1), ππ to represent

the sample mean, and π₯π₯ to represent a data point. The deviation of π₯π₯ from the mean ππ is π₯π₯ β ππ.

π¦π¦ = ππππππ(π₯π₯) is implemented in sk-learn with “whiten=True”, and the number of components/elements of

π¦π¦ is usually less than the number of components/elements of π₯π₯

(1) what is the scalar-projection of π₯π₯ in the direction of π€π€1 ?

(2) what is the scalar-projection of the deviation π₯π₯ β ππ in the direction of π€π€1?

(3) what is the first component of π¦π¦ ?

note: compute π¦π¦ using π€π€1 , π₯π₯, ππ, and ππ1

(4) assuming π¦π¦ only has one component, then we do inverse transform to recover the input

π₯π₯οΏ½ = ππππππβ1(π¦π¦)

compute π₯π₯οΏ½ using ππ, π¦π¦, ππ1 and π€π€1: π₯π₯οΏ½ =? ? ?

(5) assuming π₯π₯ and y have the same number of elements, and we do inverse transform to recover the input

π₯π₯οΏ½ = ππππππβ1(π¦π¦)

what is the value of π₯π₯ β π₯π₯οΏ½ ?

3. Show that PCA is a linear transform of π₯π₯ β ππ

Note: must use the definition on http://mathworld.wolfram.com/LinearTransformation.html

Maximum Likelihood Estimation and NLL loss

(This is a general method to estimate parameters of a PDF using data samples)

4. Maximum Likelihood Estimation when the PDF is an exponential distribution.

Suppose we have N i.i.d. (independently and identically distributed) data samples {π₯π₯1, π₯π₯2, π₯π₯3, β¦ , π₯π₯ππ}

generated from a PDF which is assumed to be an exponential distribution. π₯π₯ππ β β+ ππππππ ππ = 1 π‘π‘π‘π‘ ππ, which

means they are positive scalars. This is the PDF:

ππ(π₯π₯) = οΏ½

ππππβππππ ππππππ π₯π₯ β₯ 0

0 ππππβππππππππππππ

Your task is to build an NLL (negative log likelihood) loss function to estimate the parameter ππ of the PDF

from the data samples.

(1) write the NLL loss function: it is a function of the parameter ππ

(2) take the derivative of the loss with respect to ππ, and set the result to 0.

After some calculations, you will obtain an equation about ππ =ββββββ

Hint: read NLL in the lecture of GMM

5. Maximum Likelihood Estimation when the PDF is histogram-like.

A histogram-like PDF ππ(π₯π₯) is defined on a 1-dimensional (1D) space that is divided into fixed

regions/intervals. So, ππ(π₯π₯) takes constant value βππ in the i-th region. There are K regions. Thus,

{β1, β2, β¦ , βπΎπΎ} is the set of (unknown) parameters of the PDF. Also, β βππΞππ

πΎπΎ

ππ=1 = 1, where Ξππ is the width

of the i-th region.

Now, we have a dataset of N samples {π₯π₯1, π₯π₯2, π₯π₯3, β¦ , π₯π₯ππ}, and ππππ is the number of samples in the i-th region.

The task is to find the best parameters of the PDF using the samples.

(1) write the NLL loss function: it is a function of the parameters

Note: it is a constrained optimization problem, we need to use Lagrange multiplier to convert constrained

optimization to unconstrained optimization. Thus, add ππ(β βππΞππ

πΎπΎ

ππ=1 β 1) to the loss function, where ππ is the

Lagrange multiplier.

(2) take the derivative of the loss with respect to βππ, set it to 0, and obtain the best parameters along with

the value of ππ.

Part 2

Complete the task in H2P2T1.ipynb and H2P2T2.ipynb

Note: It is very time consuming to fit a GMM to high dimensional data, and therefore PCA + GMM is the

“standard” approach.

Grading: the number of points

Undergraduate Student Graduate Student

1 (PCA) 2 2

2 (PCA) 15 10

3 (PCA) 3 3

4 (NLL) 5 bonus points 5

5 (NLL) N.A. 5 bonus points

H1P2T1 15 15

H2P2T2 15 15

Total number of points 50 +5 50 + 5

Extra Reading

PCA is widely used in many applications. Do a google scholar search with PCA + some field, e.g., PCA

+bioinformatics or PCA + finance, you will find relevant papers.

https://www.nature.com/articles/s41467-018-04608-8

There are many variants of PCA, such as sparse PCA and kernel PCA that are implemented in sk-learn.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7798&rep=rep1&type=pdf

https://www.di.ens.fr/sierra/pdfs/icml09.pdf

https://www.di.ens.fr/~fbach/sspca_AISTATS2010.pdf

Which one is good for your application? Test different algorithms and find the best. Remember that

machine learning is more like an experimental science: you need to run lots of experiments.