Description
1 [31 points] Derivation and Proof
(a) [8 points] Consider the linear regression problem for 1D data, where we would like to learn a function
h(x) = w1x + w0 with parameters w0 and w1 to minimize the sum squared error:
L =
1
2
X
N
i=1
(y
(i) − h(x
(i)
))2
(1)
for N pairs of data samples (x
(i)
, y(i)
).
Derive the solution for w0 and w1 for this 1D case of linear
regression. Show the derivation to get the solution
w0 = Y − w1X, (2)
w1 =
1
N
PN
i=1 x
(i)y
(i) − Y X
1
N
PN
i=1(x
(i))
2 − X
2
(3)
where X is the mean of {x
(1), x(2)
, · · · , x(N)} and Y is the mean of {y
(1), y(2)
, · · · , y(N)}.
(b) [14 points] Recall the definition and property of positive (semi-)definite matrix. Let A be a real,
symmetric d × d matrix. A is positive semi-definite (PSD) if, for all z ∈ R
d
, z
⊤Az ≥ 0. A is positive
definite (PD) if, for all z ̸= 0, z
⊤Az > 0. We write A ⪰ 0 when A is PSD, and A ≻ 0 when A is PD.
It is known that every real symmetric matrix A can be factorized via the eigenvalue decomposition:
A = UΛU⊤ where U is a d × d matrix such that UU⊤ = U⊤U = I and Λ = diag(λ1, λ2, · · · , λd).
Multiplying on the right by U we see that AU = UΛ. If we let ui denote the i-th column of U, we have
Aui = λiui for each i, λi are eigenvalues of A, and the corresponding columns of U are eigenvectors
associated to λi
. The eigenvalues constitute the “spectrum” of A.
i. [6 points] Prove A is PD if and only if λi > 0 for each i (Note: if and only if means you are
asked to prove it for both directions).
ii. [8 points] Consider the linear regression problem where Φ and y are as defined in class; we saw
that the closed form solution becomes (Φ⊤Φ)−1Φ
⊤y.
Now consider a ridge regression problem with the regularization term 1
2
β∥w∥
2
2
.
We know that
the symmetric matrix in the closed-form solution is Φ⊤Φ + βI. Please (i) derive the eigenvalues
and eigenvectors for Φ⊤Φ + βI with respect to the eigenvalues and eigenvectors of Φ⊤Φ denoted
as λi and ui
, and then, (ii) for any β > 0, prove that the matrix (Φ⊤Φ + βI) is PD.
(c) [9 points] In this sub-problem, we use logistic regression to predict the class label y ∈ {−1, +1}
instead of y ∈ {0, 1} as in the ordinary logistic regression. Show that maximizing the log-likelihood of
logistic regression, PN
n=1 log P(y
(n)
|x
(n)
), is equivalent to minimizing the following loss function:
X
N
n=1
log
1 + exp(−y
(n)
· w⊤ϕ(x
(n)
))
. (4)
[Hint: You can expand the log-likelihood as follows: log P(y
(n)
|x
(n)
) = I(y
(n) = 1) log P(y
(n) =
1|x
(n)
) + I(y
(n) = −1) log P(y
(n) = −1|x
(n)
) then plug in the class posterior probability of the logistic
regression model.]
2 [39 points] Linear regression on a polynomial
In this problem, you will implement linear regression on a polynomial. Please have a look at the accompanied
starter code linear regression.py and notebook linear regression.ipynb for instructions first. Please
note that all the sub-questions without (Autograder) need to be answered in your writeup.
Sample data The files q2xTrain.npy, q2xTest.npy, q2yTrain.npy and q2yTest.npy specify a linear
regression problem for a polynomial. q2xTrain.npy represent the inputs (x
(i) ∈ R) and q2yTrain.npy
represents the outputs (y
(i) ∈ R) of the training set, with one training example per row.
2.1 GD and SGD
You will compare the following two optimization methods, in finding the coefficients of a polynomial of
degree one (i.e. slope and intercept) that minimize the training loss.
• Batch gradient descent (GD)
• Stochastic gradient descent (SGD)
Here, as we seen in the class, the training objective is defined as:
E(w) = 1
2
X
N
i=1
X
M
j=0
wjϕj (x
(i)
) − y
(i)
2
=
1
2
X
N
i=1
w⊤ϕ(x
(i)
) − y
(i)
2
(5)
(a) [12 points] (Autograder) Implement the GD and SGD optimization methods. For all the implementation details (e.g., function signature, initialization of weight vectors, etc.), follow the instruction given
in the code files. Your score for this question will be graded by the correctness of the implementation.
(b) [3 points] Please share the plot generated from the section 2(b) of your .ipynb file in your write-up,
and then compare two optimization methods, GD and SGD. Which one takes less time and which one
shows lower test objective E(wtest)?
2.2 Over-fitting study
Next, you will investigate the problem of over-fitting. Recall the figure from lecture that explored over-fitting
as a function of the degree of the polynomial M. To evaluate this behaviour, let’s examine the Root-MeanSquare (RMS) Error is defined below. (Note: we use the RMS error just for evaluation purpose, NOT as a
training objective.)
ERMS =
p
2E(w∗)/N, (6)
Figure 1: (Sample plot) Overfitting study: Train-Test accuracy.
(a) [8 points] (Autograder) In this subquestion, we will use the closed form solution of linear regression
(assuming all the condition is met) instead of iterative optimization methods. Implement the closed
form solution for the linear regression problem.
(b) [2 points] Regenerate the above plot with the provided data. The sample training data can be
generated with M-degree polynomial features (for M = 0 . . . 9) from q2xTrain.npy and q2yTrain.npy:
we assume the feature vector is ϕ(x
(i)
) = (1, x(i)
,(x
(i)
)
2
, · · · ,(x
(i)
)M) for any values of M. For the
test curve, use the data in q2xTest.npy and q2yTest.npy. Note that the trend of your curve is not
necessarily the same as the sample plot. Attach your plot to the write-up.
(c) [2 points] In the write-up, please discuss: Which degree polynomial would you say best fits the data?
Was there evidence of under/over-fitting the data? Use your generated plots to justify your answer.
2.3 Regularization (Ridge Regression)
Finally, you will explore the role of regularization. Recall the image from lecture that explored the effect of
the regularization factor λ:
Figure 2: (Sample plot) effects of the regularization factor λ.
(a) [8 points] (Autograder) Implement the closed form solution of the ridge regression. Specifically, please
use the following regularized objective function:
1
2
X
N
i=1
(w⊤ϕ(x
(i)
) − y
(i)
)
2 +
λ
2
||w||2
2
. (7)
for optimizing the parameters w.
(b) [2 points] For the sample data, our goal is to regenerate the above plot (Figure 2), but with λ ∈
{10−5
, 10−4
, . . . , 10−1
, 100
(= 1)}). Please first compute the ERMS over the training data specified
in q2xTrain.npy and q2yTrain.npy with λ and then measure the test error with q2xTest.npy and
q2yTest.npy. Please attach your (unregularized) ERMS plot of both the training data and test data,
obtained from 2(g), to the write-up. Note that the trend of your curve is not necessarily the same as
the sample plot.
(c) [2 points] Discuss: Which λ value seemed to work the best for a ninth degree polynomial (M = 9)?
Use your generated plots to justify your answer. Please provide your answer in your write-up.
3 [30 points] Locally weighted linear regression
Consider a linear regression problem in which we want to weight different training examples differently.
Specifically, suppose we want to minimize
ED(w) = 1
2
X
N
i=1
r
(i)
(w⊤x
(i) − y
(i)
)
2
,
where r
(i) ∈ R is the “local” weight for the sample (x
(i)
, y(i)
). In class, we worked on its special case where all
the weights (the r
(i)
’s) are all equal. In this problem, we will generalize some of those ideas to the weighted
setting, and also implement the locally weighted linear regression algorithm. [Note: the weight r
(i)
can be
different for each of the data points in the training data.]
(a) [3 points] Show that ED(w) can also be written as
ED(w) = (w⊤X − y
⊤)R(w⊤X − y
⊤)
⊤
for an appropriate diagonal matrix R, and where X ∈ R
D×N is a matrix whose i-th column is x
(i) ∈
R
D×1 and y ∈ R
N×1
is the vector whose i-th entry is y
(i)
. Here, please note that in locally weighted
linear regression in this homework, we use raw data directly withing mapping to high-dimensional
features (i.e., each input is represented as a D-dimensional input vector, not M-dimensional feature
vector), so that’s why we use notation for X instead of Φ. State clearly what the R matrix is.
(b) [7 points] As we already saw in the class, if all the r
(i)
’s equal 1, then the normal equation for
w ∈ R
D×1 becomes
XX⊤w = Xy,
and the value of w∗
that minimizes ED(w) is given by (XX⊤)
−1Xy.
2 Now, by finding the derivative
∇wED(w) from (a) and setting that to zero, generalize the normal equation and the closed form
solution to this locally-weighted setting, and give the new value of w∗
that minimizes ED(w) in a
closed form as a function of X, R and y. (hint: ∇w(RX⊤w) = XR⊤ = XR)
(c) [8 points] Suppose we have a training set {(x
(i)
, y(i)
);i = 1 . . . , N} of N independent examples, but
in which the y
(i)
’s were observed with differing variances. Specifically, suppose that
p(y
(i)
|x
(i)
; w) = 1
√
2πσ(i)
exp
−
(y
(i) − w⊤x
(i)
)
2
2(σ
(i))
2
I.e., y
(i)
is a Gaussian random variable with mean w⊤x
(i) and variance (σ
(i)
)
2
(where the σ
(i)
’s are
fixed, known constants). Show that finding the maximum likelihood estimate (MLE) of w reduces to
solving a weighted linear regression problem ED(w). State clearly what the r
(i)
’s are in terms of the
σ
(i)
’s.
2We presented the row-major (X ∈ RN×D) version of w∗ as (X⊤X)−1X⊤y in Lecture 2. For practice, we are using the
column-major (X ∈ RD×N ) version in this problem.
(d) [12 points, Programming Assignment] In the following, you will use the files q3x.npy which
contains the inputs (x
(i) ∈ R
N ) and q3y.npy which contains the outputs (target) (y
(i)
) for a linear
regression problem, with one training example per row.3
i. [8 points] (Autograder) Implement the closed-form solution for locally weighted linear regression
(see the accompanied code). [[TODO in compute y space implementation: our solution assumes
that the feature dimension is linear. It would be better to make this more general in the future]]
ii. [2 points] Use the implemented locally weighted linear regression solver on this dataset (using
the weighted normal equations you derived in part (b)), and plot the data and the curve resulting
from your fit. When evaluating local regression at a query point x (which is real-valued in this
problem), use weights
r
(i) = exp
−
(x − x
(i)
)
2
2τ
2
with a bandwidth parameter τ ∈ {0.1, 0.3, 0.8, 2, 10}. Please attach the plots generated in 3.(d).ii
of your notebook to write-up.
iii. [2 points] In the write-up, discuss and comment briefly on what happens to the fit when τ is
too small or too large.