Description
Questions
1. (30 points) In this problem, you will implement a program to fit two multivariate Gaussian distributions to the 2-class data and classify the test data
by computing the log odds log P(C1|x)
P(C2|x)
. The priors P(C1) and P(C2) should be
estimated from the training data. Three pairs of training data and test data
are given. The parameters µ1, µ2, S1 and S2, the mean and covariance for class
1 and class 2, are learned in the following three models for each training data
and test data pair,
• Model 1: Assume independent S1 and S2 (the discriminant function is
as equation (5.17) in the textbook).
• Model 2: Assume S1 = S2. In other words, shared S between two classes
(the discriminant function is as equation (5.22) in the textbook).
• Model 3: Assume S1 and S2 are diagonal and the diagonal entries are
identical within S1 and S2: S1 = α1I, S2 = α2I. (You need to derive the
discriminant function yourself).
(a) (10 points) Write the likelihood function and derive S1 and S2 by maximum likelihood estimation of model 2 and model 3.
(b) (10 points) Your program should return and print out the learned parameters P(C1), P(C2), µ1 and µ2 of each data pair to either terminal or
PyCharm console. Your implementation of model 1 and model 2 should
return and print out the learned parameters S1, S2. Your implementation
of model 3 will return and print out α1 and α2.
(c) (10 points) For each test set, print out the error rates of each model
to either terminal or PyCharm console (three models per each test set).
Match each data pair to one of the models and justify your answer. Also,
explain the difference in your results in the report.
1
Instructor: Rui Kuang (kuan0009@umn.edu). TA: Rachit Jas (jas00001@umn.edu) and Tianci
Song (song0309@umn.edu).
1
2. In this problem, you will apply dimension reduction and classification on the
Optdigits dataset provided in optdigits train.txt and optdigits test.txt.
(a) (5 points) Implement k-Nearest Neighbor (KNN) to classify the Optdigits
dataset with k = {1, 3, 5, 7}. Print out the error rate on the test set for
each value of k to either terminal or PyCharm console.
(b) (10 points) Implement your own version of Principal Component Analysis (PCA) and apply it the Optdigits training data. Generate a plot
of proportion of variance (see Figure 6.4 (b) in the main textbook), and
select the minimum number (K) of eigenvectors that explain at least 90%
of the variance. Show both the plot and K in the report. Project the
training and test data to the K principal components and run KNN on
the projected data for k = {1, 3, 5, 7}. Print out the error rate on the test
set for each value of k to either terminal or PyCharm console.
(c) (5 points) Next, project both the training and test data to R
2 using only
the first two principal components to plot all samples in the projected
space and label some data points with the corresponding digit in 10 different colors for the 10 types of digits for a good visualization (similar to
Figure 6.5).
(d) (10 points) Implement your own version of Linear Discriminant Analysis (LDA) and apply it to compute a projection only using the Optdigits
training data into L dimensions (L = 2, 4, 9). Run KNN on the projected
data for k = {1, 3, 5}. Print out the error rate on the test set for each
combination of k and L to either terminal or PyCharm console. (Hint:
numpy.linalg module has function pinv() which can be used to invert singular matrix as an approximation.)
(e) (10 points) Similarly, project both the training and test data to R
2 with
the LDA projections and, plot all samples in the projected space and label
some data points with the corresponding digit in 10 different colors for
the 10 types of digits.
3. In this problem, you will work on dimension reduction and classification on
a Faces dataset from the UCI repository2
. We provided the processed files
face train data 960.txt and face test data 960.txt with 500 and 124 images, respectively. Each image is of size 30×32 with the pixel values in a row in
the files and the last column identifies the labels: 1 (sunglasses), and 0 (open)
2https://archive.ics.uci.edu/ml/datasets/CMU+Face+Images
2
of the image. You can visualize the ith image with the following Python command line:
import numpy as np
import matplotlib.pyplot as plt
plt.imshow(np.reshape(img data, (30, 32)))
(a) (10 points) Implement PCA and apply it to find the principal components with combined training and test sets. First, visualize the first 5
eigen-faces using a similar command line as above.
(b) (10 points) Repeat what you did in question 2 (b), using PCA and KNN
on this Faces dataset.
(c) (10 points) Use the first K = {10, 50, 100} principle components to approximate the first five images of the training set (first row of the data
matrix) by projecting the centered data using the first K principal components then “back project” (weighted sum of the components) to the
original space and add the mean. For each K, plot the reconstructed image. Explain your observations in the report.
(Hint: Read section 6.3 on page 126 and 127 of the textbook for the
projection and ”back projection” to the original space.)
Instructions
• Solutions to all questions must be included in a report including result explanations, learned parameter values and all error rates and plots.
• All programming questions must be written in Python, no other programming
languages will be accepted. And only numpy, scipy and matplotlib can be
relied on to implement the algorithm. The code must be able to be executed
from either terminal or PyCharm console on the cselabs machines. Each function must take the inputs in the order specified and print/display the required
output to either terminal or PyCharm console. For each part, you can submit
additional files/functions (as needed) which will be used by the main functions
specified below. Put comments in your code so that one can follow the key
parts and steps. Please follow the rules strictly. If we cannot run your
code, you will receive no credit.
• Question 1:
3
– MultiGaussian(training data: file name of the training data, testing data:
file name of the testing data, Model: the model number). The function
must output the learned parameters and error rates as required in Question 1.
• Question 2:
– myKNN(training data, test data, k). The function returns the prediction
for the test set.
– myPCA(data, num principal components). The function returns the principal components and the corresponding eigenvalues.
– myLDA(data, num principal components). The function returns the projection matrix and the corresponding eigenvalues.
– script 2a.py, script 2b.py and script 2c.py Script files that solves question
2 (a), (b), (c), (d) and (e) calling the appropriate functions, do the plots
and print values asked.
• Question 3:
– script 3a.py, script 3b.py and script 3c.py Script files that solves question
3 (a), (b) and (c) calling the appropriate functions, do the plots and print
values asked.
• For each dataset, rows are the samples and columns are the features with the
last column containing the label.
• You can use the eigh function in the linalg module of numpy to calculate
eigenvalues and eigenvectors (If you use eig function in the linalg module of
numpy, you might have complex numbers in your eigenvalues). To obtain
distance between each pair of samples in KNN, you might consider to use cdist
in the spatial.distance module of scipy. To visualize the projected data, you
can use the scatter function in the pyplot module of matplotlib, and for adding
text to corresponding point, you can use either text or annotate function in the
pyplot module of matplotlib.
Submission
• Things to submit:
4
1. hw2 sol.pdf: A document which contains the report with solutions to all
questions.
2. MultiGaussian: Code for Question 1.
3. myKNN.py, myPCA.py, myLDA.py, script 2a.py, script 2b.py, script 2c.py
Code for Question 2.
4. script 3a.py, script 3b.py, script 3c.py Code for Question 3.
5. Any other files, except the data, which are necessary for your code.
• Submit: All materials must be zipped in one file, and submitted electronically
via canvas.
5