Description
1 Linear algebra
1.1 Question 1
• Prove that A ∈ R
n×n and A> have the same eigenvalues.
• Let λi are the eigenvalues of M ∈ R
n×n. Determine the eigenvalues of αM +βI, where I is the identity
matrix, and α, β ∈ R.
2 Basic Statistics
This will help you get better familiarity with probability distributions.
2.1 Question 2: Probabilities
Denote by A and B events, and by A¯ the complement of A. Prove the following:
• P(B ∩ A¯) = P(B) − P(A ∩ B)
• P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
• If A ⊂ B then P(A) ≤ P(B)
2.2 Question 3: Gaussian Distribution
The Gaussian distribution of parameters µ and σ
2
is
f(x) = 1
σ
√
2π
exp
−
(x − µ)
2
2σ
2
.
Prove that the mean and the variance are µ and σ
2
respectively.
1
2.3 Question 4: Poisson Distribution
The Poisson distribution is used to model a flow of events given only the average rate at which an event
occurs. Examples of its use include the following:
• The number of incoming requests to a server.
• The number of mutations on a strand of DNA.
• The number of cars arriving at a traffic light.
• The number of raindrops in a given area in a given time interval.
The Poisson distribution has one parameter, the average rate λ > 0 and has probability mass function
f(k; ) = P r(X = x) = λ
k
e
−λ
k!
.
• Compute mean and variance of the Poisson distribution (hint: Taylor expansion for e
λ
).
• Let X1 ∼ P oisson(λ1) and X2 ∼ P oisson(λ2) be independent random variables. Show that the
random variable Z = X1 + X2 is Poisson-distributed and compute its mean.
2.4 Question 5: Estimators
Let X1, · · · , Xn be i.i.d. random variables with mean µ and variance σ
2
.
• Prove that
Mn =
1
n
Xn
i=1
Xi
Sn =
1
n − 1
Xn
i=1
(Xi − Mn)
2
,
are unbiased estimators of the mean and variance, that is, E[Mn] = µ and E[Sn] = σ
2
.
• For a distribution of your choice (Gaussian, uniform, etc.) implement these estimators and empirically
show on a plot that these estimators indeed converge to the true quantities µ and σ
2 as we increase
the sample size n. Based on your plots guess how fast (in terms of powers of n) they converge to the
correct quantities.
3 (Agnostic) PAC Learning
3.1 Question 6
Let D be a distribution over X × {0, 1}, and let S = {(x1, y1), · · · ,(xm, ym)} be a random sample from D.
Let
LD(h) = P(x,y)∼D[h(x) 6= y]
LS(h) = 1
m
|{i : h(xi) 6= yi}|
Let hS and h
∗ be the hypotheses in F with minimum training and generalization error, respectively:
hS = arg min
h∈F
LS(h),
h
∗ = arg min
h∈F
LD(h) .
Be sure to keep in mind that, unlike h
∗
, hS is a random variable that depends on the random sample S.
2
• Prove that
E[LS(hS)] ≤ LD(h
∗
) ≤ E[LD(hS)] .
• We now make use of a stronger concentration inequality than the one used in class.
Theorem 1 (McDiarmids inequality). Let a function f for which
∀x1, · · · , xm, x0
i
: |f(x1, · · · , xi
, · · · , xm) − f(x1, · · · , x0
i
, · · · , xm)| ≤ ci
,
and X1, · · · , Xm independent but not necessarily identically distributed random variables. Then the
following holds
P [|f(X1, · · · , Xm) − E [f(X1, · · · , Xm)]| ≥ ] ≤ 2 exp
−2
2
Pm
i=1 c
2
i
.
Using this theorem prove that, with probability at least 1 − δ
|LS(fS) − E[LS(fS)]| ≤ O
r
ln(1/δ)
m
!
,
where there the constants hidden in the big-O notation do not depend on |F|.
• Explain in words the meaning of what you proved and how we would expect training error to compare
to test error when using a machine learning algorithm on actual data.
3.2 Question 7
Let F a hypothesis class of binary classifiers. Show that if F is agnostic PAC learnable, then F is PAC
learnable as well. Furthermore, if A is a successful agnostic PAC learner for F, then A is also a successful
PAC learner for F
4 Least Square Regression
4.1 Question 8
Here you will code a Matlab (or Octave) function that implements the Least Square Regression algorithm.
The input to the algorithm is the training set S = {(x1, y1), · · · ,(xm, ym)}, where xi ∈ R
d and yi ∈ R, for
i = 1, · · · , m. The output is the hyperplane solution of the Least Square problem, w∗ ∈ R
d
. That is
min
w,w0
Xm
i=1
(yi − (w0 + hw, xii))2
• The prototype of the function must be
[w,w_0] = train_ls(X,y,bias)
where X ∈ R
m×d
is the matrix of m input vectors in d dimension, i.e. each input xi
is a row of X,
y is a column vector of m columns containing the labels associated to the training samples. The flag
’bias’ signals if a bias w0 must be used or not. If no bias is used, the returned value of w0 is 0. The
code must be robust to the case the the matrix X>X is not invertible.
• Implement another version, with prototype
w=incremental_train_ls(X,y)
3
that construct the solution incrementally. This version does not have the bias w0 term. This implementation must use the Sherman-Morrison formula to update the inverse of the matrix X>X:
(A + uv>)
−1 = A
−1 −
A−1uv>A−1
1 + v>A−1u
• Verify numerically that the solutions of the two algorithms on a random training set are the same.
• Discuss the computational complexity of both versions and compare them.
4.2 Question 9
In this question, you are supposed to implement linear regression using polynomial basis functions. Use only
monomials of a single variable, e.g. (x1, x2
1
, x3
1
, · · · , x2, x2
2
, · · ·), and no cross-terms, e.g. (x1x2, x2
1×2, · · ·).
• Implement a Matlab function that normalizes all the input data attributes to be between -1 and 1.
[X_train_norm, X_test_norm] = normalizeAll(X_train, X_test)
Note that only training data can be used for learning the normalization transformation. This will be
used to prevent numerical instability in dealing with numbers that are too big or too small.
• Implement a function that generates a matrix of input samples that contains the powers of each feature
from 1 to k
[X_poly] = generate_poly_features(X,k)
• Use the training and test data provided in the file “cadata.mat”. They correspond to a random split
train/test of the Housing dataset from the UCI repository. The task is to predict the median house
value from features describing a town. Use the code you wrote above to generate polynomial features
from k = 1 to k = 5, normalize the features, and train 5 different LS regressors (with bias w0) with
these features. Plot training error and test error for each value of k and discuss the results.
4