Homework #3 CSE 446: Machine Learning

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment

Description

5/5 - (4 votes)

Classification

1. [5 points] Suppose we run the Perceptron algorithm with w initialized to be an arbitrary unit vector. Suppose that the algorithm is then given the same vector x (whose label is 1) over and over again. How many
mistakes can it make in the worst case? Express your answer in terms of ||x||2. (Hint: derive the result from
first principles, do not attempt to use the general result we proved in class).

2. Consider using a linear decision boundary for classification (labels in {−1, 1}) of the form w · x = 0 (i.e., no
offset). Now consider the following loss function evaluated at a data point (x, y) which is a variant on the hinge
loss.
`((x, y), w) = max(0, −y(w · x)).

a. [2 points] Given a dataset of (xi
, yi) pairs, write down a single step of subgradient descent with a step
size of η if we are trying to minimize
1
n
Xn
i=1
`((xi
, yi), w)
for `(·) defined as above. That is, given a current iterate we what is an expression for the next iterate?

b. [2 points] Use what you derived to argue that the Perceptron (as formulated in class) can be viewed as
implementing SGD applied to the loss function just described (for what value of η)?

c. [1 points] Suppose your data was drawn iid and that there exists a w∗
that separates the two classes
perfectly. Provide an explanation for why hinge loss is generally preferred over the loss given above.

3. We’ve talked a lot about binary classification, but what if we have k > 2 classes, like the 10 digits of MNIST?
Concretely, suppose you have a dataset {(xi
, yi)}
n
i=1 where xi ∈ R
d and yi ∈ {1, . . . , k}. Like in our least squares
classifier of homework 1 for MNIST, we will assign a separate weight vector w(`)
for each class ` = 1, . . . , k; let
W = [w(1)
, . . . , w(k)
] ∈ R
d×k

. We can generalize the binary classification probabilistic model to multiple classes
as follows: let
PW (yi = `|W, xi) = exp(w(`)
· xi)
Pk
j=1 exp(w(j)
· xi)
The negative log-likelihood function is equal to
L(W) = −
Xn
i=1
X
k
`=1
1{yi = `} log
exp(w(`)
· xi)
Pk
j=1 exp(w(j)
· xi)
!
1
Define the softmax(·) operator to be the function that takes in a vector θ ∈ R
d and outputs a vector in R
d
whose ith component is equal to P
exp(θi)
d
j=1 exp(θj )
. Clearly, this vector is nonnegative and sums to one. If for any i
we have θi  maxj6=i θj then softmax(θ) approximates ei
, a vector of all zeros with a one in the ith component.
For each yi
let yi be the one-hot encoding of yi (i.e., yi ∈ {0, 1}
k
is a vector of all zeros aside from a 1 in the
yith index).

a. [5 points] If yb
(W)
i = softmax(W>xi), show that ∇W L(W) = −
Pn
i=1 xi(yi − yb
(W)
i
)
>.
b. [5 points] Recall problem 6 of Homework 1 (Ridge Regression on MNIST) and define J(W) = 1
2
Pn
i=1 kyi−
W>xik
2
2
. If ye
(W)
i = W>xi show that ∇W J(W) = −
Pn
i=1 xi(yi − ye
(W)
i
)
>. Comparing the least squares
linear regression gradient step of this part to the gradient step of minimizing the negative log likelihood
of the logistic model of part a may shed light on why we call this classification problem logistic regression.

c. [15 points] Using the original representations of the MNIST flattened images xi ∈ R
d
(d = 28 × 28 = 764)
and all k = 10 classes, implement gradient descent (or stochastic gradient descent) for both J(W) and
L(W) and run until convergence on the training set of MNIST. For each of the two solutions, report
the classification accuracy of each on the training and test sets using the most natural arg maxj ejW>xi
classification rule.
Kernels

4. [5 points] Suppose that our inputs are one dimensional and that our feature map is infinite dimensional:
φ(x) is a vector whose ith component is
1

i!
e
− x
2
2 x
i
.
for all nonnegative integers i. (Thus, φ is an infinite dimensional vector.) Show that K(x, x0
) = e

(x−x
0)
2
2 is a
kernel function for this feature map, i.e.,
φ(x) · φ(x
0
) = e

(x−x
0)
2

2 .
Hint: Use the Taylor expansion of e
z
. (This is the one dimensional version of the Gaussian (RBF) kernel).
5. This problem will get you familiar with kernel ridge regression using the polynomial and RBF kernel. First
let’s generate some data. Let n = 30 and f∗(x) = 4 sin(πx) cos(6πx2
). For i = 1, . . . , n let each xi be drawn
uniformly at random on [0, 1] and yi = f∗(xi) + i where i ∼ N (0, 1). For any function f, the true error is
defined as
Etrue(f) = EXY [(f(X) − Y )
2
]
whereas the training error is defined as
Ebtrain(f) = 1
n
Xn
i=1
(f(xi) − yi)
2
.
Using kernel ridge regression, construct a predictor
αb = arg min
α
||Kα − y||2 + λαT Kα , fb(x) = Xn
i=1
αbik(xi
, x)
where Ki,j = k(xi
, xj ) is a kernel evaluation and λ is the regularization constant.

a. [10 points] Using leave-one-out cross validation, find a good λ and hyperparameter settings for the following
kernels:
• kpoly(x, z) = (1 + x
T
z)
d where d ∈ N is a hyperparameter,
2
• krbf (x, z) = exp(−γkx − zk
2
) where γ > 0 is a hyperparameter1
.
Report the values of d, γ, and the λ values for both kernels.
b. [10 points] Let fbpoly(x) and fbrbf (x) be the functions learned using the hyperparameters you found in part
a. For a single plot per function fb∈ {fbpoly(x), fbrbf (x)}, plot the original data {(xi
, yi)}
n
i=1, the true f(x),
and fb(x) (i.e., define a fine grid on [0, 1] to plot the functions).
c. [5 points] We wish to build Bootstrap percentile confidence intervals for fbpoly(x) and fbrbf (x) for all
x ∈ [0, 1] from part b. Use the non-parametric bootstrap with B = 300 datasets to find 5% and 95%
percentiles at each point x on a fine grid over [0, 1] (see Hastie, Tibshirani, Friedman Ch. 8.2 for a review).
Specifically, for each dataset b ∈ {1, . . . , B}, draw uniformly at randomly with replacement n samples from
{(xi
, yi)}
n
i=1, train an fbb using the bth resampled dataset, compute fbb(x) for each x in your fine grid; let
the 5th percentile at point x be the largest value ν such that 1
B
PB
b=1 1{fbb(x) ≤ ν} ≤ .05, define the 95%
analogously. Plot the 5 and 95 percentile curves on the plots from part b.

d. [5 points] Repeat all parts of this problem with n = 300 (you may just use 10-fold CV instead of leaveone-out)

e. [5 points] For this problem, use the fbpoly(x) and fbrbf (x) learned in part d. Suppose m = 1000 additional
samples (x
0
1
, y0
1
), . . . ,(x
0
m, y0
m) are drawn i.i.d. the same way the first n samples were drawn. Use the nonparametric bootstrap with B = 300 to construct a confidence interval on E[(Y −fbpoly(X))2−(Y −fbrbf (X))2
]
(i.e. randomly draw with replacement m samples denoted as {(xe
0
i
, ye
0
i
)}
m
i=1 from {(x
0
i
, y0
i
)}
m
i=1 and compute
1
m
Pm
i=1 
(ye
0
i − fbpoly(xe
0
i
))2 − (ye
0
i − fbrbf (xe
0
i
))2

, repeat this B times) and find 5% and 95% percentiles.

Using this confidence interval, is there statistically signicant evidence to suggest that one of fbrbf and fbpoly
is better than the other at predicting Y from X? (Hint: does the confidence interval contain 0?)
Context: With binary labels like in the extra credit below, we can easily compute means and variances and
appeal to the CLT to compute confidence intervals. However, in almost all other settings computing nearly-exact
confidence intervals is impossible because there is information we simply do not have. The Bootstrap gives us
an ability to construct confidence intervals on nearly any quantity of interest with minimal side-knowledge.

k-means clustering

6. Given a dataset x1, …, xn ∈ R
d and an integer 1 ≤ k ≤ n, recall the following k-means objective function
min
π1,…,πk
X
k
i=1
X
j∈πi
kxj − µik
2
2
, µi =
1
|πi
|
X
j∈πi
xj . (1)
Above, {πi}
k
i=1 is a partition of {1, 2, …, n}. The objective (1) is NP-hard2
to find a global minimizer of.

Nevertheless the commonly used heuristic which we discussed in lecture, known as Lloyd’s algorithm, typically
works well in practice. Implement Lloyd’s algorithm for solving the k-means objective (1). Do not use any off
the shelf implementations, such as those found in scikit-learn.

a. [5 points] Run the algorithm on the training dataset of MNIST with k = 10, plotting the objective function
(1) as a function of iteration. Visualize (and include in your report) the cluster centers as a 28 ×28 image.
b. [5 points] For k = {2, 5, 10, 20, 40, 80, 160, 320, 640, 1280} run the algorithm on the training dataset to
obtain centers {µi}
k
i=1. If {(xi
, yi)}
n
i=1 and {(x
0
i
, y0
i
)}
m
i=1 denote the training and test sets, respectively,
plot the training error 1
n
Pn
i=1 minj=1,…,k kµj − xik
2
2 and test error 1
m
Pm
i=1 minj=1,…,k kµj − x
0
i
k
2
2 as a
function of k on the same plot. (If the large values of k are taking unresonably long to compute, just go
up to the largest value of k you can.)

1Given a dataset x1, . . . , xn ∈ Rd, a heuristic for choosing a range of γ in the right ballpark is the inverse of the median of all
n
2

squared distances ||xi − xj ||2
2
.

2To be more precise, it is both NP-hard in d when k = 2 and k when d = 2. See the references on the wikipedia page for k-means
for more details.

Multivariate Gaussians

7. For a matrix A ∈ R
n×n we denote |A| as the determinant of A. A multivariate Gaussian with mean µ ∈ R
n
and covariance Σ ∈ R
n×n has a probability density function p(x|µ, Σ) = √
1
(2π)n|Σ|
exp(−
1
2
(x − µ)
T Σ
−1
(x − µ))
which we denote as N (µ, Σ). For background on multivariate Gaussians, see Murphy 2.5.2, 2.6.1, and 4.1. Let
• µ1 =

1
2

and Σ1 =

1 0
0 2
• µ2 =

−1
1

and Σ2 =

2 −1.8
−1.8 2 
• µ3 =

2
−2

and Σ3 =

3 1
1 2

For each i = 1, 2, 3 on a separate plot:
a. [5 points] Draw n = 100 points Xi,1, . . . , Xi,n ∼ N (µi
, Σi) and plot the points as a scatter plot with
each point as a triangle marker (Hint: use numpy.random.randn(d) to generate a d-dimensional vector
Z ∼ N (0, I), then use the fact that AZ + b ∼ N (b, AA>). Be careful, if symmetric A satisfies A2 = Σ
then A is the matrix square root of Σ which in general is not the square root of the entries of Σ. See
Murphy references above)
b. [5 points] Compute the sample mean and covariance matrices µbi =
1
n
Pn
j=1 Xi,j and Σbi =
1
n−1
Pn
j=1(Xi,j−
µbi)
2
. Compute the eigenvectors of Σbi
. Plot the unit-norm eigenvectors as line segments originating from
µbi and have magnitude equal to the square root of their corresponding eigenvalues. Make sure your axes
are square (e.g., the x and y limits are both [−r, r] for some appropriately large r > 0 to see the data.)
(Hint: see numpy.linalg.eig.)
c. [5 points] If (ui,1, λi,1) and (ui,2, λi,2) are the eigenvector-eigenvalue pairs of the sample covariance matrix
with λi,1 ≥ λi,2 and ||ui,1||2 = ||ui,2||2 = 1, for j = 1, . . . , n let Xei,j =



1
λi,1
u
T
i,1
(Xi,j − µbi)

1
λi,2
u
T
i,2
(Xi,j − µbi)

. On a new
figure, plot these new points as a scatter plot with each point as a circle marker. Make sure your axes are
square.

Extra credit: Statistically Significant Improvement

8. [2 points] (Extra Credit) This problem explores the connection between quantiles (or confidence intervals)
and hypothesis testing. See Murphy 2.2.6 and any introductory statistics text for background on hypothesis
testing3
. Let X ∈ N (µ, 1) for some µ ∈ R. Consider the hypothesis test
H0 : µ = 0
H1 : µ 6= 0
so that under H0 we have E[X] = µ = 0, and under H1 we have E[X] = µ 6= 0. Define zα/2 to be the unique
number such that R ∞
zα/2

1

e
−x
2/2dx = α/2. We say we reject the null hypothesis H0 in favor of H1 if |X| ≥ zα/2.

a. Under H0, show that the probability of rejecting the null hypothesis is less than or equal to α.
b. If we do not reject the null hypothesis (i.e., |X| < zα/2) is this evidence to support that H0 is true? Justify
your answer.

3There are many resources online as well, e.g., https://machinelearningmastery.com/statistical-hypothesis-tests/
4

9. [8 points] (Extra Credit) Suppose you have two models f1 and f2 that predict Y from X where (X, Y ) follow
a joint distribution P (e.g., X ∈ R
d
, Y ∈ {−1, 1}). For j = 1, 2 define
pj = PXY (fj (X) 6= Y ) ρ =
EXY [(1{f1(X) 6= Y } − p1)(1{f2(X) 6= Y } − p2)]
p
p1(1 − p1)p2(1 − p2)
noting that each variance EXY [(1{fj (X) 6= Y } − pj )
2
] = pj (1 − pj ). The quantity ρ ∈ [−1, 1] determines the
degree to which the predictions of f1 and f2 are correlated. You wish to determine if one model versus the other
is better at predicting Y from X in a statistically significant way. That is, you are faced with the hypothesis
test
H0 : p1 = p2
H1 : p1 6= p2.
We will leverage our result from problem 1 to construct a hypothesis test such that under H0, the null hypothesis
H0 is rejected in favor of H1 with probability less than α.
Let (x1, y1), . . . ,(xn, yn)
iid∼ P and (x
0
1
, y0
1
), . . . ,(x
0
m, y0
m)
iid∼ P be two independent fresh samples from P of size
n and m, respectively. For j = 1, 2 define
pbj =
1
n
Xn
i=1
1{fj (xi) 6= yi} pb
0
j =
1
m
Xm
i=1
1{fj (x
0
i
) 6= y
0
i}
a. (Means) Show that E[pbj ] = E[pb
0
j
] = pj for j = 1, 2.
b. (Variances) Show that E[(pbj − pj )
2
] = pj (1−pj )
n
and E[(pb
0
j − pj )
2
] = pj (1−pj )
m for j = 1, 2.
c. (Empirical Variance) Argue briefly that pbj (1 − pbj ) → pj (1 − pj ) as n → ∞ by appealing to the Law of
Large Numbers (LLN).
d. (Independent Test statistic) Define θ :=
pb1−pb
0
√ 2
pb1(1−pb1)/n+pb
0
2
(1−pb
0
2
)/m
. Using the approximations of part c (for
example, replace pb1(1 − pb1) with p1(1 − p1)), argue briefly that θ ∼ N (0, 1) under the null hypothesis
H0 (i.e., p1 = p2) as n, m → ∞ by appealing to the central limit theorem (CLT). Note by above that
this implies the probability of |θ| ≥ zα/2 is no greater than α. (Hint: what are E[θ] and E[(θ − E[θ])2
] as
n, m → ∞?)
e. (Dependent Test statistic) Define φ := √ pb1−pb2
pb1(1−pb1)/n+pb2(1−pb2)/n
. Using the approximations of part c, argue
that under the null hypothesis H0 (i.e., p1 = p2) as n → ∞ we have that φ ∼ N (0, 1 − ρ). Explain why
the probability that |φ| ≥ zα/2 is no longer necessarily bounded by α
f. (Correcting the test statistic) Using the approximations of part c, argue that the probability that |φ|/

2 ≥
zα/2 is no greater than α under the null hypothesis H0 (i.e., p1 = p2) as n → ∞. (Hint: the sum/difference
of even dependent Gaussian random variables is still Gaussian distributed.)
5
g. (Dependent Test statistic revisited) Define
∆ = b
1
n
Xn
i=1
(1{f1(xi) 6= yi} − 1{f2(xi) 6= yi})
= pb1 − pb2
σb
2 =
1
n
Xn
i=1
(1{f1(xi) 6= yi} − 1{f2(xi) 6= yi} − ∆) b 2
=
1
n
Xn
i=1
(1{f1(xi) 6= yi} − pb1 + pb2 − 1{f2(xi) 6= yi})
2
=
1
n
Xn
i=1
(1{f1(xi) 6= yi} − pb1)
2 + (1{f2(xi) 6= yi} − pb2)
2 + (1{f1(xi) 6= yi} − pb1)(1{f2(xi) 6= yi} − pb2)
=
pb1(1 − pb1)
n
+
pb2(1 − pb2)
n
+
1
n
Xn
i=1
(1{f1(xi) 6= yi} − pb1)(1{f2(xi) 6= yi} − pb2)
so that E[∆] = b p1 −p2 and E[σb
2
] = E[(∆b −E[∆]) b 2
]. If ψ := ∆b√
σb2
then as n → ∞ we have that ψ ∼ N (0, 1)
using the approximations of part c. Under the null hypothesis H0 (i.e., p1 = p2), we have shown that both
tests: {|φ|/

2 ≥ zα/2} and {|ψ| ≥ zα/2} reject the null hypothesis with probability at most α as n → ∞.

Define the power of a test to be the probability that the null hypothesis H0 (p1 = p2) is rejected when the
alternative hypothesis H1 (p1 6= p2) is true. Which of these two tests, {|φ|/

2 ≥ zα/2} or {|ψ| ≥ zα/2},
do you think is more powerful? Justify your answer.
Context: It is common in machine learning to have standard benchmark datasets in order to evaluate whether
your new algorithm performs better than past algorithms. If we have access to how both algorithms classified
each individual test point, we could use the estimator of f. or g. (both are correct), but note that if we only
have access to the empirical means reported in the paper, we can only (safely) use f.
6