CSC546 Homework 2

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (7 votes)

Part 1
PCA
1. PCA can be explained from two different perspectives. What are the two perspectives?
2. The first principal direction is the direction in which the projections of the data points have the largest
variance in the input space. We use πœ†πœ†1 to represent the first/largest eigenvalue of the covariance matrix, 𝑀𝑀1
to denote the corresponding principal vector/direction (𝑀𝑀1 has unit length i.e., L2 norm is 1), πœ‡πœ‡ to represent
the sample mean, and π‘₯π‘₯ to represent a data point. The deviation of π‘₯π‘₯ from the mean πœ‡πœ‡ is π‘₯π‘₯ βˆ’ πœ‡πœ‡.
𝑦𝑦 = 𝑃𝑃𝑃𝑃𝑃𝑃(π‘₯π‘₯) is implemented in sk-learn with “whiten=True”, and the number of components/elements of
𝑦𝑦 is usually less than the number of components/elements of π‘₯π‘₯
(1) what is the scalar-projection of π‘₯π‘₯ in the direction of 𝑀𝑀1 ?
(2) what is the scalar-projection of the deviation π‘₯π‘₯ βˆ’ πœ‡πœ‡ in the direction of 𝑀𝑀1?
(3) what is the first component of 𝑦𝑦 ?
note: compute 𝑦𝑦 using 𝑀𝑀1 , π‘₯π‘₯, πœ‡πœ‡, and πœ†πœ†1
(4) assuming 𝑦𝑦 only has one component, then we do inverse transform to recover the input
π‘₯π‘₯οΏ½ = π‘ƒπ‘ƒπ‘ƒπ‘ƒπ‘ƒπ‘ƒβˆ’1(𝑦𝑦)
compute π‘₯π‘₯οΏ½ using πœ‡πœ‡, 𝑦𝑦, πœ†πœ†1 and 𝑀𝑀1: π‘₯π‘₯οΏ½ =? ? ?
(5) assuming π‘₯π‘₯ and y have the same number of elements, and we do inverse transform to recover the input
π‘₯π‘₯οΏ½ = π‘ƒπ‘ƒπ‘ƒπ‘ƒπ‘ƒπ‘ƒβˆ’1(𝑦𝑦)
what is the value of π‘₯π‘₯ βˆ’ π‘₯π‘₯οΏ½ ?
3. Show that PCA is a linear transform of π‘₯π‘₯ βˆ’ πœ‡πœ‡
Note: must use the definition on http://mathworld.wolfram.com/LinearTransformation.html
Maximum Likelihood Estimation and NLL loss
(This is a general method to estimate parameters of a PDF using data samples)
4. Maximum Likelihood Estimation when the PDF is an exponential distribution.
Suppose we have N i.i.d. (independently and identically distributed) data samples {π‘₯π‘₯1, π‘₯π‘₯2, π‘₯π‘₯3, … , π‘₯π‘₯𝑁𝑁}
generated from a PDF which is assumed to be an exponential distribution. π‘₯π‘₯𝑛𝑛 ∈ β„›+ 𝑓𝑓𝑓𝑓𝑓𝑓 𝑛𝑛 = 1 𝑑𝑑𝑑𝑑 𝑁𝑁, which
means they are positive scalars. This is the PDF:
𝑓𝑓(π‘₯π‘₯) = οΏ½
πœ†πœ†π‘’π‘’βˆ’πœ†πœ†πœ†πœ† 𝑓𝑓𝑓𝑓𝑓𝑓 π‘₯π‘₯ β‰₯ 0
0 π‘œπ‘œπ‘œπ‘œβ„Žπ‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’π‘’
Your task is to build an NLL (negative log likelihood) loss function to estimate the parameter πœ†πœ† of the PDF
from the data samples.
(1) write the NLL loss function: it is a function of the parameter πœ†πœ†
(2) take the derivative of the loss with respect to πœ†πœ†, and set the result to 0.
After some calculations, you will obtain an equation about πœ†πœ† =βˆ—βˆ—βˆ—βˆ—βˆ—βˆ—
Hint: read NLL in the lecture of GMM
5. Maximum Likelihood Estimation when the PDF is histogram-like.
A histogram-like PDF 𝑓𝑓(π‘₯π‘₯) is defined on a 1-dimensional (1D) space that is divided into fixed
regions/intervals. So, 𝑓𝑓(π‘₯π‘₯) takes constant value β„Žπ‘–π‘– in the i-th region. There are K regions. Thus,
{β„Ž1, β„Ž2, … , β„ŽπΎπΎ} is the set of (unknown) parameters of the PDF. Also, βˆ‘ β„Žπ‘–π‘–Ξ”π‘–π‘–
𝐾𝐾
𝑖𝑖=1 = 1, where Δ𝑖𝑖 is the width
of the i-th region.
Now, we have a dataset of N samples {π‘₯π‘₯1, π‘₯π‘₯2, π‘₯π‘₯3, … , π‘₯π‘₯𝑁𝑁}, and 𝑁𝑁𝑖𝑖 is the number of samples in the i-th region.
The task is to find the best parameters of the PDF using the samples.
(1) write the NLL loss function: it is a function of the parameters
Note: it is a constrained optimization problem, we need to use Lagrange multiplier to convert constrained
optimization to unconstrained optimization. Thus, add πœ†πœ†(βˆ‘ β„Žπ‘–π‘–Ξ”π‘–π‘–
𝐾𝐾
𝑖𝑖=1 βˆ’ 1) to the loss function, where πœ†πœ† is the
Lagrange multiplier.
(2) take the derivative of the loss with respect to β„Žπ‘–π‘–, set it to 0, and obtain the best parameters along with
the value of πœ†πœ†.
Part 2
Complete the task in H2P2T1.ipynb and H2P2T2.ipynb
Note: It is very time consuming to fit a GMM to high dimensional data, and therefore PCA + GMM is the
“standard” approach.
Grading: the number of points
Undergraduate Student Graduate Student
1 (PCA) 2 2
2 (PCA) 15 10
3 (PCA) 3 3
4 (NLL) 5 bonus points 5
5 (NLL) N.A. 5 bonus points
H1P2T1 15 15
H2P2T2 15 15
Total number of points 50 +5 50 + 5
Extra Reading
PCA is widely used in many applications. Do a google scholar search with PCA + some field, e.g., PCA
+bioinformatics or PCA + finance, you will find relevant papers.
https://www.nature.com/articles/s41467-018-04608-8
There are many variants of PCA, such as sparse PCA and kernel PCA that are implemented in sk-learn.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7798&rep=rep1&type=pdf
https://www.di.ens.fr/sierra/pdfs/icml09.pdf
https://www.di.ens.fr/~fbach/sspca_AISTATS2010.pdf
Which one is good for your application? Test different algorithms and find the best. Remember that
machine learning is more like an experimental science: you need to run lots of experiments.