CSCI567 Homework 6


Category: You will Instantly receive a download link for .zip solution file upon Payment


5/5 - (1 vote)

1 Principal Component Analysis

For the following problems, we have N zero-mean data points xi ∈ R
D×1 and S =
i=1 xix
i ∈
D×D is the sample covariance matrix of the dataset.

1.1 Derivation of Second Principal Component
(a) (5 points) Let cost function
J =
(xi − pi1e1 − pi2e2)
(xi − pi1e1 − pi2e2)
with e1 and e2 are the orthonormal vector basis for the dimensionality reduction, i.e. ke1k2 =
1, ke2k2 = 1, and e
e2 = 0, and some coefficients pi1 and pi2.

Show that ∂J
= 0 yields pi2 = e
2 xi
, i.e. the projection length of data point xi along vector

(b) (5 points) Show that the value of e2 that minimizes cost function
J˜ = −e
2 Se2 + λ2

2 e2 − 1

+ λ12
2 e1 − 0

is given by the eigenvector associated with the second largest eigenvalue of S.

λ2 is the Lagrange Multiplier for equality constraint e
e2 = 1 and λ12 is the Lagrange
Multiplier for equality constraint e
e1 = 0.

Hint: Recall that Se1 = λ1e1 (e1 is the normalized eigenvector associated with the largest
eigenvalue λ1 of S) and ∂y
∂y =

A + AT

y. Also notice that S is a symmetric matrix.

1.2 Derivation of PCA Residual Error
(a) (5 points) Prove that for a data point xi
kxi −
2 = x
i xi −
j xix
i ej

Hint: The most common method to proof a mathematical equation of this flavor is by using
mathematical induction. To perform a proof by mathematical induction in this case, first
show that the equation above holds for the base case K = 1, and then using the assumption
that the equation holds for K = k − 1, show that the equation also holds for K = k, for any
k > 1.
Use the fact that e
ej = 1 (length of eigenvector ej is 1) and e
em = 0 for j 6= m (eigenvectors
are perpendicular each other for square symmetric matrix S). Also, use definition pij = e
j xi

(b) (5 points) Now show that
JK ,

i xi −
j xix
i ej

 =
i xi −
Hint: recall that e
j Sej = λje
ej = λj

(c) (5 points) If K = D principal components are used, there is no truncation, so JD = 0.
Use this to show that the error from only using K < D principal components is given by
JK =

2 Hidden Markov Model

A simple DNA sequence is O = O1O2 · · · OT , with each component Oi takes from {A, C, G, T}.
Assume it is generated from a Hidden Markov Model controlled by a hidden variable X, which
takes two possible states S1, S2.

This HMM has the following parameters Θ = {πi
, aij , bik} for i, j = 1, 2 and k ∈ {A, C, G, T}:
• Initial state distribution πi for i = 1, 2:
π1 = P(X1 = S1) = 0.7; π2 = P(X1 = S2) = 0.3.

• Transition probabilities aij = P(Xt+1 = Sj |Xt = Si) for any t ∈ N
+, i = 1, 2, and j = 1, 2:
a11 = 0.8, a12 = 0.2; a21 = 0.4, a22 = 0.6.

• Emission probabilities bik = P(Ot = k|Xt = Si) for any t ∈ N
+, i = 1, 2, and k ∈ {A, C, G, T}:
b1A = 0.4, b1C = 0.1, b1G = 0.4, b1T = 0.1;
b2A = 0.2, b2C = 0.3, b2G = 0.2, b2T = 0.3;

Assume we have an observed sequence O = O1O2 · · · O6 = AGCGT A, please answer the following
questions with step-by-step computations and explanations.
(a) (5 points) Probability of an observed sequence. Calculate P(O; Θ).
(b) (5 points) Filtering. Calculate P(X6 = Si
|O; Θ) for i = 1, 2.
(c) (5 points) Smoothing. Calculate P(X4 = Si
|O; Θ) for i = 1, 2.
(d) (5 points) Most likely explanation. Compute X = X1X2 · · · X6 = arg maxX P(X|O; Θ).
(e) (5 points) Prediction. Compute O7 = arg maxO P(O|O; Θ).

3 Programming Part

3.1 Principal Component Analysis (25 points)
In this programming assignment, you will be implementing the Principal Component Analysis
(PCA) algorithm on MATLAB for data (image) representation compression and then use the
compressed representation for classification.

For this purpose, the datasets could be loaded from file hw6 pca.mat, which consists of labeled
training dataset (X.train and y.train) and labeled test dataset (X.test and y.test). The
training dataset X.train contains 9000 samples (9000 rows), each row being 1 sample. Each
sample (each row) is a 16-by-16 grayscale pixel intensity (with possible values between 0 and 255,
inclusive), which mostly represent a single digit handwritten number, and can be visualized on
MATLAB. For example if you want to visualize the 5438-th handwritten digit in the training
dataset, you can use the following commands on MATLAB Command Window:
load(’hw6 pca.mat’)
imshow(double(reshape(X.train(5438,:), 16, 16)),[]);

which will display a handwritten number ’2’. The training label y.train contains the groundtruth label of this handwritten digit. The test dataset X.test and its label y.test are similar
correspondingly to X.train and y.train, but with much lesser samples, 2000 in total.

(a) (5 points) To begin with, you will be implementing eigenvectors computation and sorting based on eigenvalues’ magnitude in the provided template file get sorted eigenvecs.m.
Please see the description inside the file for more details.

(b) (5 points) Each of the computed eigenvectors is a vector ∈ R
, thus by itself it can be
displayed as an 16 × 16 image, using the command:
imshow(double(reshape(eigenvecs(:,i), 16, 16)),[]);
to display the image of i-th eigenvector (let us call it an ”eigendigit”).

Please plot the top 8 eigendigits, corresponding to the top 8 biggest eigenvalues. You may
use MATLAB commands figure, hold on, subplot, imshow, hold off for this purpose.
Report this plot on your *.pdf file (as well as the hardcopy).

(c) (10 points) Perform data representation compression on the training data. Since X.train
is R
9000×256, if you pick top K eigendigits to project into, you will get the compressed
representation X compressed, which is R
9000×K. Now, from this compressed representation X compressed, you can reconstruct the data, producing X reconstruction, which is
9000×256 again, but somewhat distorted as compared to the original image. The degree of
distortion depends on how many eigendigits K that you used to represent the data in compression. The less K you used, the more severe the distortion is. Here is an example image
of the reconstruction.

In the picture above, the first row is the raw image drawn from X.train. The second to
the sixth row are the reconstructed digits, when using K = 1, 5, 10, 20, 80 eigendigits (pc
stands for principal components, or eigendigits), respectively. The columns from left-to-right
corresponds to sample # 4900, 5900, 6900, 7900, and 8900, respectively, drawn from X.train.

For this part, you need to report similar images like the above, but for sample # 5500, 6500,
7500, 8000, and 8500, respectively, drawn from X.train. Again, you may use MATLAB
commands figure, hold on, subplot, imshow, hold off for this purpose.
Report this plot on your *.pdf file (as well as the hardcopy).

(d) (5 points) In this final part of programming assignment for PCA, we will do classification on
the compressed data. Therefore you need to compress both X.train and X.test in the same
manner (using the same number of eigendigits representer). Do NOT forget to subtract
each sample in X.test with the mean of X.train, before projecting them to get the compressed representation.

For classification, we will just use the simplest and readily-availableon-MATLAB Decision Tree algorithm. You may use either in
MATLAB R2013b or fitctree in MATLAB R2015b. Use the following command to perform classification:
tree = data,train label,’SplitCriterion’, ’deviance’);
train label inferred = predict(tree,train data);
test label inferred = predict(tree,test data);

Please report the accuracy of the training prediction and test prediction (similar like in Homework 1), as well the amount of time (in seconds) required to finish the computation in your
computer, for 5 different choices of K (number of top eigendigits used in the representation): 1, 5, 10, 20, 80.

Also, provide the analysis on this results, why the accuracy
and computation time differs between different choices of K.
Report both the results and the analysis on your *.pdf file (as well as the hardcopy).

3.2 Hidden Markov Model (25 points)
In this programming assignment, you are supposed to implement an HMM model to discover the
anomaly patterns in the computer systems.

3.2.1 Data Descriptions
We take part of the UMN login and ps dataset from these pages:
~immsec/systemcalls.htm and Please
carefully study the data descriptions in the two web pages.

The training file hw6 hmm train.txt is from one of the ps normal data file, and the test files
hw6 hmm test normal.txt and hw6 hmm test trojan.txt are from the other ps normal data file
and the recovered Trojan ps data file, respectively.

Basically, hw6 hmm train.txt, hw6 hmm test normal.txt, and hw6 hmm test trojan.txt contain 15, 9, and 11 traces (time series), respectively, with varied lengths. Notice that the same PID
in different files refers to different traces.

3.2.2 Algorithm
Implement the EM algorithm to train a single HMM for all traces using EM algorithm. Set the
number of different hidden states S equal to number of different observations O (different system
call numbers shown in the training file).
We denote the parameters of HMM as Θ = {πi
, aij , bik}, with initial state distributions π,
transition probabilities aij , and emission probabilities bik.

3.2.3 Questions
Please answer these questions in your *.pdf file report.
(a) (2 points) What are the lengths of the longest and shortest traces in the training data? How
many different observations?

(b) (5 points) Since you are supposed to train a single HMM from multiple time series, the
original Baum-Welch algorithm is no longer applicable, but a modified EM algorithm still
works. Thus you need to compute the log likelihood of the training data D when you know
the hidden states S for all the time series.

Please give the formula to compute the log likelihood LD of the training data D. It might
contains terms related to the HMM parameters Θ, the hidden states S, and the observations
O. Please describe the steps and notations clearly.

(c) (9 points) Please provide the updating formulas in M-step for πi
, aij , bik.
We define the soft counts: γ
(i) is the soft count of the hidden state of time series d at time t
is i, i.e., #{s
t = i}, and δ
(i, j) is the soft count of the hidden states of time series d at time
t and t + 1 are i, j, respectively, i.e., #{s
t+1 = ij}.

You can introduce other notations for intermediate computation results if necessary. Please
describe the steps and notations clearly.

(d) (9 points) Based on the update rules you have derived, write a Matlab program to train
HMM on the training dataset. Please do not use any existing packages.

Then apply the well-trained model on the two test datasets: For each time series O,
(a) compute its log likelihood P(O|Θ), and
(b) generate 100 time series samples with the same length of O, and compute the mean and
standard deviation of the log likelihood of these samples.

Please report these values in tables. For each time series, you need to report its log likelihood,
the sampled log likelihood mean and standard deviation, in total (9 + 11) × 3 = 60 values.

You are supposed to see difference of the log likelihood values for the normal and trojan(anomaly) time series in this question.

Note: You need to avoid the numerical underflow caused by multiplying many small numbers
(probabilities), so you may choose to store and manipulate log x instead of x in your programs.

Then instead of computing x × y, you just compute log x + log y. However, you need be careful to
compute x + y without converting the logarithm number back into original number explicitly. For
example, assuming x > y, then log(x + y) = log x + log (1 + exp(log y − log x)).

Submission Instruction: You need to provide the followings:
• Provide your answers to problems 1-2, 3.1(b-d), and 3.2 in *.pdf file, named as
CSCI567 hw6 fall15.pdf. You need to submit the homework in both hard copy (at
the collection locker #19 at the PHE building 1st floor by 5pm of the deadline date)
and electronic version as *.pdf file on Blackboard. If you choose handwriting instead of
typing all the answers, you will get 40% points deducted.

• Submit ALL the code and report via Blackboard. The only acceptable language is MATLAB. For your program, you MUST include the main function called CSCI567 hw6 fall15.m
in the root of your folder. After running this main file, your program should be able
to generate all of the results needed for this programming assignment, either as plots
or console outputs.

You can have multiple files (i.e your sub-functions), however, the
only requirement is that once we unzip your folder and execute your main file, your
program should execute correctly. Please double-check your program before submitting.
You should only submit one *.zip file. No other formats are allowed except *.zip file.
Also, please name it as [lastname] [firstname] hw6