## Description

1. (10 points)

(a) Let v1, v2, . . . , vd be d mutually orthogonal unit vectors in R

d

(i.e., a basis for R

d

), let λ1, . . . , λd

be real numbers, and

A =

∑

d

i=1

λiviv

⊤

i

. (1)

Show that v1, . . . , vd are eigenvectors of A with corresponding eigenvalues λ1, . . . , λd.

(b) Conversely, show that if v1, . . . , vd is an orthonormal system of eigenvectors and λ1, . . . , λd are

the corresponding eigenvalues of a symmetric matrix A ∈ R

d×d

, then A is of the form (1).

2. (20 points) Let A ∈ R

d×d be a symmetric matrix, and for simplicity assume that all its eigenvalues are

positive and distinct, 0 < λ1 < λ2 < . . . < λd. Let v1, v2, . . . , vn be the corresponding (normalized)

eigenvectors.

(a) Prove that any two of the eigenvectors vi and vj (assuming i ̸= j) are orthogonal (you may wish

to compare viAvj and vjAvi).

(b) Explain why this implies that v1, v2, . . . , vd form an orthonormal basis for R

d

.

(c) The Rayleigh quotient of A is

R(w) = w⊤Aw

w⊤w

w ∈ R

n

.

Prove that the maximum of R(w) is λd, and that the maximum is attained at w = vd.

3. (20 points) Recall that the empirical covariance matrix (sample covariance matrix) of a dataset

{x1, x2, . . . , xn} with xi ∈ R

d

(assuming that 1

n

∑n

i=1 xi = 0) is

Σb =

1

n

∑n

i=1

xix

⊤

i

.

(a) Since Σb is symmetric, it has an orthonormal basis of eigenvectors v1, v2, . . . , vd with λ1 ≤ λ2 ≤

. . . ≤ λd, and it can be expressed as

Σb =

∑

d

i=1

λivivi

.

Let Σb

(1)

be the reduced empirical covariance matrix

Σb

(1)

=

1

n

∑n

i=1

(xi − (xi

· vd)vd)(xi − (xi

· vd)vd)

⊤.

Show that Σb

(1)

vd = 0, while Σb

(1)

vi = λivi for all i < d. What are the eigenvalues and eigenvectors

of Σb

(1)

then? Use this to show that the second principal component is vd−1.

(b) Use induction to show that the k’th principal component of the data is vd−k+1.

1

4. (25 points) The file 3Ddata.txt is a dataset of 500 points in R

3

sampled from a manifold with some

added noise. The last number in each line is just an index in {1, 2, 3, 4} related to the position of the

point on the manifold to make the visualization prettier (for example, you can plot those points with

index 1 in green, index 2 in yellow, 3 in blue and 4 in red).

Apply PCA and Isomap to map this data to R

2

. To construct the graph (mesh) for Isomap you can

use k = 10 nearest neighbors. Plot the results and comment on the differences. For both methods you

need to write your own code (it shouldn’t be more than a few lines each) and submit it together with

the write-up.

5. (25 points) The file train35.digits contains 2000 images of 3’s and 5’s from the famous MNIST

database of handwritten digits in text format. The size of each image is 28 × 28 pixels. Each row

of the file is a representation one image, with the 28 × 28 pixels flattened into a vector of size 784.

A value of 1 for a pixel represents black, and value of 0 represents white. The corresponding row of

train35.labels is the class label: +1 for the digit 3, or −1 for the digit 5. The file test35.digits

contains 200 testing images in the same format as train35.digits.

Implement the perceptron algorithm and use it to label each test image in test35.digits. Submit the

predicted labels in a file named test35.predictions. In the lectures, the perceptron was presented

as an online algorithm. To use the perceptron as a batch algorithm, train it by simply feeding it

the training set M times. The value of M can be expected to be less than 10, and should be set by

cross validation. Naturally, in this context, the “mistakes” made during training are not really errors.

Nonetheless, it is intructive to see how the frequency of mistakes decreases as the hypothesis improves.

Include in your write-up a plot of the cumulative number of “mistakes” as a function of the number of

examples seen. You may find that it improves performance to normalize each example to unit norm.

2