Description
1 [21 points] Logistic regression
(a) [8 points] Consider the log-likelihood function for logistic regression:
`(w) = X
N
i=1
y
(i)
log h(x
(i)
) + (1 − y
(i)
) log(1 − h(x
(i)
)),
where h(x) = sigmoid(wT x) = 1
1+exp(−wT x)
.
Find the Hessian H of this function, and show that is negative semi-definite and thus ` is concave and
has no local maxima other than the global one. That is, show that
z
T Hz ≤ 0
for any vector z. [Hint: You might want to start by showing the fact that P
i
P
j
zixixj zj = (x
T z)
2 ≥ 0.]
(b) [8 points] Using the H you calculated in part (a), write down the update rule implied by Newton’s
method for optimizing `(w). Now use this rule (and not a library function) to implement Newton’s
method and apply it to binary classification problem specified in files q1x.npy and q1y.npy. The two
columns of q1x.npy represent the inputs (x
(i)
) and q1y.npy represents the outputs (y
(i) ∈ {0, 1}), with
one training example per row. Initialize Newtons method with w = 0 (the vector of all zeros). What
are the coefficients w, including the intercept term, resulting from your fit? [Hint: You might refer to
the code in q1 hint.py to start programming.]
(c) [5 points] Plot the training data (your axes should be x1 and x2, corresponding to the two coordinates
of the inputs, and you should use a different symbol for each point plotted to indicate whether that
example had label 1 or 0). Also plot on the same figure the decision boundary fit by logistic regression.
(I.e., this should be a straight line showing the boundary separating the region where h(x) > 0.5 from
where h(x) ≤ 0.5.)
1
2 [32 points] Linear regression on a polynomial
The files q2xTrain.npy, q2xTest.npy, q2yTrain.npy and q2yTest.npy specify a linear regression problem
for a polynomial. q2xTrain.npy represent the inputs (x
(i) ∈ R) and q2yTrain.npy represents the outputs
(y
(i) ∈ R) of the training set, with one training example per row.
(a) [12 points] For each of the following optimization methods, find the coefficients (slope and intercept)
that minimize the error for a first degree polynomial.
• Batch gradient descent
• Stochastic gradient descent
• Newton’s method
i. [9 points] Give the coefficients generated by each of the three optimization methods.
ii. [3 points] How did the methods compare in terms of rate of convergence? [Hint: The training
process can be viewed as convergent when the mean squared error on the training dataset is low
enough (e.g ≤ 0.2) and stable.]
(b) [10 points] Next, you will investigate the problem of over-fitting. Recall the figure from lecture that
explored over-fitting as a function of the degree of the polynomial M, where the Root-Mean-Square
(RMS) Error is define as
ERMS =
p
2E(w∗)/N,
where
E(w) = 1
2
X
N
i=1
(
X
M
j=0
wjφj (x
(i)
) − y
(i)
)
2 =
1
2
X
N
i=1
(wT φ(x
(i)
) − y
(i)
)
2
:
i. [7 points] Using Newton’s method, find the coefficients that minimize the error for a M-degree
polynomial (for M = 0 . . . 9) for the training data specified in q2xTrain.npy and q2yTrain.npy.
Now use these parameters to regenerate the above chart, using q2xTest.npy and q2yTest.npy as
the test data.
2
ii. [3 points] Which degree polynomial would you say best fits the data? Was there evidence of
under/over-fitting the data? Use your generated charts to defend your answer.
(c) [10 points] Finally, you will explore the role of regularization. Recall the image from lecture that
explored the affect of the regularization factor λ:
i. [7 points] Using Newton’s method, find the coefficients that minimize the error for a ninth degree
polynomial (M = 9) given regularization factor λ (for λ = {0, 10−6
, 10−5
, . . . , 10−1
, 100
(1)}) for
the training data specified in q2xTrain.npy and q2yTrain.npy. Now use these parameters to plot
the ERMS of both the training data and test data as a function of λ (regenerate the above chart,
using q2xTest.npy and q2yTest.npy as the test data). Specifically, use the following regularized
objective function:
1
2
X
N
i=1
(wT φ(x
(i)
) − y
(i)
)
2 +
λ
2
||w||2
2
.
for optimizing the parameters w, but please use the original (unregularized) ERMS for plotting.
ii. [3 points] Which λ value seemed to work the best?
3 [24 points] Locally weighted linear regression
Consider a linear regression problem in which we want to weight different training examples differently.
Specifically, suppose we want to minimize
ED(w) = 1
2
X
N
i=1
r
(i)
(wT x
(i) − y
(i)
)
2
.
In class, we worked out what happens for the case where all the weights (the r
(i)
’s) are the same. In this
problem, we will generalize some of those ideas to the weighted setting, and also implement the locally
weighted linear regression algorithm. (Note that the weight r
(i)
can be different for each of the training
example.)
3
(a) [2 points] Show that ED(w) can also be written
ED(w) = (Xw − y)
T R(Xw − y)
for an appropriate diagonal matrix R, and where X is a matrix whose i-th row is x
(i) and y is the vector
whose i-th entry is y
(i)
. State clearly what R is.
(b) [6 points] If all the r
(i)
’s equal 1, then we saw in class that the normal equation is
XT Xw = XT y,
and that the value of w∗
that minimizes ED(w) is given by (XT X)
−1XT y. By finding the derivative
∇wED(w) and setting that to zero, generalize the normal equation to this weighted setting, and give
the new value of w∗
that minimizes ED(w) in closed form as a function of X, R and y.
(c) [5 points] Suppose we have a training set {(x
(i)
, y(i)
);i = 1 . . . , N} of N independent examples, but in
which the y
(i)
’s were observed with differing variances. Specifically, suppose that
p(y
(i)
|x
(i)
; w) = 1
√
2πσ(i)
exp(−
(y
(i) − wT x
(i)
)
2
2(σ
(i))
2
)
I.e., y
(i) has mean wT x
(i) and variance (σ
(i)
)
2
(where the σ
(i)
’s are fixed, known, constants). Show that
finding the maximum likelihood estimate of w reduces to solving a weighted linear regression problem.
State clearly what the r
(i)
’s are in terms of the σ
(i)
’s.
(d) [11 points] The following items will use the files q3x.npy which contains the inputs (x
(i)
) and q3y.npy
which contains the outputs (y
(i)
) for a linear regression problem, with one training example per row.
i. [2 points] Implement (unweighted) linear regression (y = wT x) on this dataset (using the normal
equations), and plot on the same figure the data and the straight line resulting from your fit.
(Remember to include the intercept term.)
ii. [6 points] Implement locally weighted linear regression on this dataset (using the weighted normal
equations you derived in part (b)), and plot on the same figure the data and the curve resulting
from your fit. When evaluating h(·) at a query point x (which is real-valued in this problem), use
weights
r
(i) = exp(−
(x − x
(i)
)
2
2τ
2
)
with a bandwidth parameter τ = 0.8. (Again, remember to include the intercept term.)
iii. [3 points] Repeat (ii) four times with τ = 0.1, 0.3, 2 and 10. Comment briefly on what happens
to the fit when τ is too small or too large.
4 Derivation and Proof
(a) [8 points] Consider the linear regression problem for 1D data, where we would like to learn a function
h(x) = w1x + w0 with parameters w0 and w1 to minimize the sum squared error L =
1
2
PN
i=1(y
(i) −
h(x
(i)
))2
for N pairs of data samples (x
(i)
, y(i)
). Derive the solution for w0 and w1 for this 1D case of
linear regression. Show the derivation to get the solution w0 = Y¯ − w1X¯ and w1 =
1
N
PN
i=1 x
(i)y
(i)−Y¯ X¯
1
N
PN
i=1 x(i)2−X¯ 2
where X¯ is the mean of {x
(1), x(2)
, · · · , x(N)} and Y¯ is the mean of {y
(1), y(2)
, · · · , y(N)}.
4
(b) [15 points] Consider the definition and property of positive (semi-)definite matrix. Let A be a real,
symmetric d × d matrix. A is positive semi-definite (PSD) if, for all z ∈ Rd
, z
T Az ≥ 0. A is positive
definite (PD) if, for all z 6= 0, z
T Az > 0. We write A 0 when A is PSD, and A 0 when
A is PD. The spectral theorem says that every real symmetric matrix A can be expressed via the
spectral decomposition A = UΛUT where U is a d × d matrix such that UUT = UT U = I and
Λ = diag(λ1, λ2, · · · , λd). Multiplying on the right by U we see that AU = UΛ. If we let ui denote the
i-th column of U, we have Aui = λiui for each i. Then λi are eigenvalues of A, and the corresponding
columns are eigenvectors associated to λi
. The eigenvalues constitute the ”spectrum” of A, and the
spectral decomposition is also called the eigenvalue decomposition of A.
i. [6 points] Prove A is PD iff λi > 0 for each i.
ii. [9 points] Consider the linear regression problem where Φ and y are as defined in class and the closed
form solution is (ΦT Φ)−1Φ
T y. We can get the eigenvalues of symmetric matrix ΦT Φ using spectral
decomposition. We apply ridge regression, and the symmetric matrix in solution is ΦT Φ+βI. Prove
that the ridge regression has an effect of shifting all singular values by a constant β. For any β > 0,
ridge regression makes the matrix ΦT Φ + βI PD.
Credits
Some questions adopted/adapted from http://www.stanford.edu/class/cs229/ps/ps1.pdf and from Bishop
PRML.
5