Description
1. Linear Regression.
(a) Consider a linear regression problem with a dataset containing N data points {(x
(i)
, y(i)
)}
N
i=1, where
x
(i) ∈ R
d
. The accumulated loss function is given by:
LOLS(w) = 1
2
||Xw − y||2
2
where y ∈ R
N , X ∈ R
N×(d+1) and w ∈ R
d+1
.
i. Find the Hessian matrix of LOLS(w).
Hint: You may want to use the fact that ||Xw − y||2
2 = (Xw − y)
>(Xw − y)
ii. Recall that a twice-continuously differential function f(x) is strictly convex i.f.f. its Hessian is
positive definite for all x. Prove that if N is less than the input dimension d, LOLS(w) can not
be strictly convex.
iii. No matter what X is, prove that for ∀w1, w2 ∈ arg minw∈Rd+1 LOLS(w), we have Xw1 = Xw2.
Note that w1, w2 can be different.
Hint: Use the convexity of the loss function and the convex combinations of w1 and w2.
(b) Consider the same dataset with an L2-norm regularization added to the OLS loss function. Linear
regression with L2 regularization is also called Ridge regression. Recall the composite loss function of
ridge regression:
Lridge(w) = 1
2
||Xw − y||2
2 +
λ
2
||w||2
2
i. One advantage of ridge regression is that for a positive regularization constant (λ > 0), the
matrix (X>X + λI) is always invertible. Prove that the matrix (X>X + λI) is invertible by
showing that it’s positive definite.
ii. Knowing that Lridge(w) is a convex function, show that the estimator for ridge regression is:
wˆ = (X>X + λI)
−1X>y
Solution.
2
2. Programming – Linear Regression.
Recall that the empirical risk in the linear regression method is defined as Rb(w) :=
1
2N
PN
i=1(w>x
(i)−y
(i)
)
2
,
where x
(i) ∈ R
d
is a data point and y
(i)
is an associated label.
(a) Implement linear regression using gradient descent in the linear gd(X, Y, lrate, num iter)
function of hw1.py.
The arguments for this function are: X as the training features, a tensor with shape N × d; Y as the
training labels, an N × 1 tensor; lrate as learning rate (step size), a float number; and num iter
as the number of iterations for gradient descent to run. The objective of this function is to find
parameters w that minimize the empirical risk Rb(w) using gradient descent (only gradient descent).
To keep consistent with the standard program and get correctly scored, prepend a column of ones
to X in order to accommodate the bias term in w, thus w should has d + 1 entries. Then use w = 0
as the initial parameters, and return
Hint. If your are new to machine learning or programming with pytorch, we offer some kind
suggestions. First, try using the vector/matrix operations provided in pytorch and avoid using
for-loops. This will improve both the efficiency and style of your program. Second, create your
own test cases for debugging before submission. With very few samples in your own test case, it is
convenient to compare the program output with your manual calculation. Third, to avoid matrix
computation error, remember to check the shapes of tensors regularly.
Library routines: torch.matmul (@), torch.tensor.shape, torch.tensor.t, torch.cat,
torch.ones, torch.zeros, torch.reshape.
(b) Implement linear regression by using the pseudo inverse to solve for w in the
linear normal(X,Y) function of hw1.py.
The arguments for this function are: X as the training features, a tensor with shape N × d tensor;
Y as the training labels, an N × 1 tensor. To keep consistent with the standard program and get
correctly scored, prepend a column of ones to X in order to accommodate the bias term in w, thus
w should has d + 1 entries.
Library routines: torch.matmul (@), torch.cat, torch.ones, torch.pinverse.
(c) Implement the plot linear() function in hw1.py. Follow the steps below.
Use the provided function hw1 utils.load reg data() to generate a training set X and training
labels Y. Then use linear normal() to calculate the regression results w. Eventually plot the points
of dataset and regressed curve. Return the plot as output. Note that plot linear() should return
the figure object and you should include the visualization in your written submission.
Hint. If your are new to plotting machine learning visualizations, we offer some kind suggestions.
matplotlib.pyplot is an “extremely” useful tool in machine learning, and we commonly refer to it
as plt. Please first get to know the most basic usages by examples from its official website (such
as scatter plots, line plots, etc.). As for our programming question specifically, you may divide and
conquer it by first plotting the points in the dataset, then plotting the linear regression curve.
Library routines: torch.matmul (@), torch.cat, torch.ones, plt.plot, plt.scatter,
plt.show, plt.gcf where plt refers to the matplotlib.pyplot library.
Solution.
3
3. Programming – Logistic Regression.
Recall the empirical risk Rb for logistic regression (as presented in lecture 3):
Rblog(w) = 1
N
X
N
i=1
log(1 + exp(−y
(i)w>x
(i)
)).
Here you will minimize this risk using gradient descent.
(a) In your written submission, derive the gradient descent update rule for this empirical risk by
taking the gradient. Write your answer in terms of the learning rate (step size) η, previous parameters
w, new parameters w0
, number of examples N, and training examples x
(i)
. Show all of your steps.
(b) Implement the logistic() function in hw1.py. You are given as input a training set X, training
labels Y, a learning rate (step size) lrate, and number of gradient updates num iter. Implement
gradient descent to find parameters w that minimize the empirical risk Rblog(w). Perform gradient
descent for num iter updates with a learning rate (step size) of lrate. Same as previous questions,
initialize w = 0, return w as output, and prepend X with a column of ones.
Library routines: torch.matmul (@), torch.tensor.t, torch.exp.
(c) Implement the logistic vs ols() function in hw1.py. Use hw1 utils.load logistic data() to
generate a training set X and training labels Y. Run logistic(X,Y) from part (b) taking X and Y as
input to obtain parameters w (use the defaults for num iter and lrate). Also run linear gd(X,Y)
from Problem 2 to obtain parameters w. Plot the decision boundaries for your logistic regression
and least squares models along with the data X. [Note: As we learned in the class that the decision
rule of Least Squares and Logistic Regression for predicting the class label is sign(wˆ
>x), the decision
boundary can be obtained from wˆ
>x = 0, i.e., for d = 2, we have x2 = −(wˆ0 + wˆ1 × x1)/wˆ2.] Include
the visualizations in your written submission. Which model appears to classify the data better?
Explain in the written submission that why you believe it is better for this problem.
Library routines: torch.linspace, plt.scatter, plt.plot, plt.show, plt.gcf.
Solution.
4
4. Convexity, Lipschitz Continuity, and Smoothness
(a) Convexity
i. Show that if a function f : R
n → R is convex, then for any matrix A ∈ R
n×m and vector b ∈ R
n,
the function g : R
m → R defined by g(x) = f(Ax + b) is convex, where x ∈ R
m.
ii. Prove that if the differentiable function f is λ-strongly convex and the differentiable function g
is convex then f + g is λ-strongly convex.
iii. Given m convex functions {fi
: R
n → R}
m
i=1, denote
f(x) = max
i∈[m]
fi(x),
where [m] = {1, 2, . . . , m}. Prove that f is convex.
(b) Lipschitzness and Smoothness
We say a function f : R
n → R
d
is ρ-Lipschitz if ∀x1, x2 ∈ R
n, it holds that kf(x1) − f(x2)k2 ≤
ρkx1 − x2k2.
i. Prove that if f : R
n → R
m and g : R
m → R
d are ρ-Lipschitz functions, then the composite
g ◦ f : R
n → R
d defined by (g ◦ f)(x) = g(f(x)) is ρ
2
-Lipschitz.
ii. Given a differentiable function f : R
n → R whose gradient is β-Lipschitz, prove that for
∀x, y ∈ R
n we have
f(y) − f(x) ≤ ∇f(x)
>(y − x) + β
2
ky − xk
2
2
.
Hint: You are not required to follow the hints, but please consider them if you have no idea
for proof. (1) Define a tool function φ(t) = f((1 − t)x + ty), thus f(y) − f(x) = φ(1) − φ(0) =
R 1
0
(figure it out); (2) If you get stuck at the final steps, taking a look at the Cauchy–Schwarz
inequality may be helpful.
Solution.
5