Description
1. (20 points) Estimating π Using Simulation.
1
Read all instructions before you start.
Consider a bivariate random variable X := (X1, X2), where X1, X2 are both independent and
have a uniform distribution over (−1, 1).
(a) (5 points) What is the probability that the random variable X takes values within a circle of
radius 1 centered at the origin ? Derive the expression.
(b) (5 points) Use the previous result to estimate the value of π purely relying on simulation of
X. Justify your approach in words.
(c) (5 points) Report the estimates of π using sample sizes N = 10, 102
, 103
, 104
, 105
, 106
, 107
, 108
.
Use the “single” datatype for the simulation. How will you write code to handle the situation
when you desire to increase the accuracy of the estimate with a sample size as large as
N = 109 or larger ? Does your code handle this case ?
(d) (5 points) How will you estimate the sample size M required to have the estimate of π within
[π − 0.01, π + 0.01] with 0.95 probability ? Assume the true value of π to be known, so that
you know the interval exactly. Describe an algorithm and justify it. Compute this estimate for
the sample size M and report it.
2. (20 points) Multivariate Gaussian.
Read all instructions before you start.
Generate N points (with N taking the values 10, 102
, 103
, 104
, 105
) from a multivariate 2D Gaussian probability density function with mean µ = [1, 2]0 and a covariance matrix C with the first row
as [1.6250, −1.9486] and the second row as [−1.9486, 3.8750].
For this generation, you are only allowed to use the randn() and eig() functions in Matlab.
For each data sample of size N, compute the maximum-likelihood (ML) estimates of the mean
and the covariance matrix.
For this estimation, you are only allowed to use the sum() function in Matlab.
(a) (5 points) Describe and justify your method for generating sample points from the 2D Gaussian.
(b) (5 points) For each value of N, repeat the experiment 100 times, and plot a box plot of the
error between the true mean µ and the ML estimate µbN (which depends on N), where the
error measure is k µ − µbN k2 / k µ k2. Use a logarithmic scale on the horizontal axis, i.e.,
log10 N.
(c) (5 points) For each value of N, repeat the experiment 100 times, and plot a box plot of the
error between the true covariance C and the ML estimate CbN (which depends on N), where
the error measure is k C − CbN kFro / k C kFro. Use a logarithmic scale on the horizontal
axis, i.e., log10 N.
(d) (5 points) For each value of N, for a single data sample, within a single figure, plot the 2D
scatter plot of the generated data and show the principal modes of variation of the data by
plotting a line starting at the empirical mean and going a distance equal to the empirical
eigen-value along a direction given by the empirical eigen-vector.
2
3. (15 points) PCA and Hyperplane Fitting.
Read all instructions before you start.
Consider the observed set of points of the form (x, y) ∈ R
2
in the file “points2D_Set1.mat”.
Assume each observation (x, y) is drawn independently from the joint probability density function
P(X, Y ) of random variables X and Y .
For this question, you cannot use the functions mean(), cov(), and pca() in Matlab.
• (5 points) How can principal component analysis (PCA) be used to best approximate a linear
relationship between random variables X and Y . Describe the method clearly, using appropriate mathematical descriptions for clarity. Your description should be clear enough to lead to a
programmable implementation.
• (5 points) Show a scatter plot of the points. Overlay on the scatter plot, the graph of a line
showing the linear relationship between Y and X.
• (5 points) Repeat the same analysis for the set of points in “points2D_Set2.mat”. Show a scatter
plot of the points. Overlay on the scatter plot, the graph of a line showing the linear relationship
between Y and X. Compared to the result on the other set of points, justify the quality of the
approximation resulting in this question using logical arguments.
4. (25 points) Principal Component Analysis (PCA).
Read all instructions before you start.
Download the dataset comprising images of handwritten digits in http://yann.lecun.com/
exdb/mnist; this has been downloaded in the folder “data” and stored as “mnist.mat”.
Each image is stored as a matrix (28 × 28) of numbers. You can visualize these images (or
matrices) in Matlab using the functions imagesc() or imshow(). Use the Matlab command “axis
equal” to use the same units on each axis of the image.
For the following computations, make sure to convert (cast) the integer data type to a floatingpoint type. For this question, you cannot use the functions mean(), cov(), and pca() in Matlab.
For every digit, from 0 to 9, compute:
(i) the mean µ (3 points),
(ii) the covariance matrix C (5 points), and
(ii) the principal mode of variation determined by the eigenvector v1 and the corresponding eigenvalue λ1 (where λ1 is the largest of all eigenvalues) of the covariance matrix C (7 points).
Note: Before computing the mean and covariance matrix, convert each 28×28 pixel image matrix
to a 282 × 1 vector by concatenating its columns. To visualize the 282 × 1 mean vector, convert
it back to a matrix and then visualize it using imagesc(). Use the reshape() function to change
matrices to vectors and vice versa. The covariance matrix will be of size 282 × 282
.
• (5 points) For each digit, sort the 282 eigenvalues of the covariance matrix and plot them as
a graph. Comment and justify what you observe. How many “principal” / significant modes of
variation (i.e., number of “large” eigenvalues) do you find, for each digit ? Are the significant
modes of variation equal to 282 or far less ? Why ?
• (5 points) For each digit, show the 3 images side by side: (i) µ−
√
λ1v1, (ii) µ, and (iii) µ+
√
λ1v1,
to show the principal mode of variation of the digits around their mean. Comment and justify what
3
you observe. For a certain digit, say 1, what does the principal mode of variation tell you about
how people write that digit ?
5. (10 points) Principal Component Analysis (PCA) for Dimensionality Reduction.
Read all instructions before you start.
Download the dataset comprising images of handwritten digits in http://yann.lecun.com/
exdb/mnist; this has been downloaded in the folder “data” and stored as “mnist.mat”.
As of now, for each digit, each 28 × 28 pixel image is represented using 282
coordinate values
in the Euclidean space of dimension 282
. Suppose you decide to re-represent the images using
only 84 coordinates (instead of 282 = 784) in a 84-dimensional basis for some 84-dimensional
hyperplane within the original Euclidean space, such that the chosen 84-dimensional hyperplane
maximizes the total dispersion of the original data (for the chosen digit) within the hyperplane.
• (5 points) Write a function to compute those 84 coordinates, for each of the ten digits (0–9).
• (5 points) Give an algorithm for regenerating / reconstructing the image using those 84 coordinates (and the knowledge of the designed 84-dimensional basis). For each of the ten digits (0–9),
pick an image, and show the original and the reconstructed images side by side.
6. (25 points) Principal Component Analysis (PCA) for Another Image Dataset
Read all instructions before you start.
Consider the dataset provided within the folder “data_fruit”
For this question, you cannot use the functions mean(), cov(), and pca() in Matlab.
Each datum is an image of size 80 × 80 pixels with 3 color channels red (R), green (G), and blue
(B), i.e., a 80 × 80 × 3 array. For PCA, each image should be resized to a vector of length 19200.
For visualization, reshape each vector back to a RGB image of size 80 × 80 pixels using the
function reshape(), followed by a shift and rescaling of the values into the range [0, 1], followed by
displaying the matrix using the function image().
• (9 points: 1 + 4 + 4) Similar to the analysis done in previous question, find the mean µ, the
covariance matrix C, and the top 4 principle eigenvectors of C. Display the mean and the
eigenvectors as images (side by side, in the same figure); you can use the function subplot().
Find the top 10 eigenvalues, sort them, and plot their values on a graph. Use the function eigs()
for efficient computation.
• (8 points: 4 + 4) For each fruit image in the dataset, finds its closest representation as a
linear combination of the top 4 eigenvectors added to the mean. Use the measure of closeness
as the Frobenius norm of the difference. Describe the algorithm used to produce this closest
representation in mathematical terms and describe the logic behind your algorithm. Display the
original fruit image and its closest representation, as images (side by side, in the same figure).
• (9 points: 3 + 3 + 3) Using all of the top 4 eigenvectors and the mean image, sample random
images to generate new images of “fruit”. Describe the underlying algorithm clearly in words and
including suitable mathematical notation. Display three such images that are distinct from any
image in the given the dataset, but are representative of the dataset and can be considered as
that of a new / generated fruit.
4