Description
Problem 1 (10 points) Different class conditional probabilities. Consider a classification
problem with features in R
d and labels in {−1, +1}. Consider the class of linear classifiers of
the form ( ¯w, 0), namely all the classifiers (hyper planes) that pass through the origin (or t = 0).
Instead of logistic regression, suppose the class probabilities are given by the following function,
where X¯ ∈ R
d are the features:
P
y = +1|X, ¯ w¯
=
1
2
1 +
w¯ · X¯
p
1 + ( ¯w · X¯)
2
!
, (1)
where ¯w · X¯ is the dot product between ¯w and X¯.
Suppose we obtain n examples (X¯
i
, yi) for i = 1, . . . , n.
1. Show that the log-likelihood function is
J( ¯w) = −n log 2 +Xn
i=1
log
1 +
yi( ¯w · X¯
i)
p
1 + ( ¯w · X¯
i)
2
!
. (2)
2. Compute the gradient of J( ¯w) and write one step of gradient ascent. Namely fill in the blank:
w¯j+1 = ¯wj + η ·
hint: use the chain rule and ∇w¯w¯ · X¯ = X¯.
In Problem 2, and Problem 3, we will study linear regression. We will assume in both the
problems that w
0 = 0. (This can be done by translating the features and labels to have mean zero,
but we will not worry about it). For ¯w = (w
1
, . . . , wd
), and X¯ = (X¯ 1
, . . . , X¯ d
), the regression we
want is:
y = w
1X¯ 1 + . . . + w
dX¯ d = ¯w · X. ¯ (3)
We considered the following regularized least squares objective, which is called as Ridge Regression. For the dataset S = {(X¯
1, y1), . . . ,(X¯
n, yn)},
J( ¯w, λ) = Xn
i=1
yi − w¯ · X¯
i
2 + λ · kw¯k
2
2
.
Problem 2 (10 points) Gradient Descent for regression.
1. Instead of using the closed form expression we mentioned in class, suppose we want to perform
gradient descent to find the optimal solution for J( ¯w). Please compute the gradient of J, and
write one step of the gradient descent with step size η.
2. Suppose we get a new point X¯
n+1, what will the predicted yn+1 be when λ → ∞?
Problem 3 (15 points) Regularization increases training error. In the class we said that
when we regularize, we expect to get weight vectors with smaller, but never proved it. We also
displayed a plot showing that the training error increases as we regularize more (larger λ). In this
assignment, we will formalize the intuitions rigorously.
Let 0 < λ1 < λ2 be two regularizer values. Let ¯w1, and ¯w2 be the minimizers of J( ¯w, λ1), and
J( ¯w, λ2) respectively.
1. Show that kw¯1k
2
2 ≥ kw¯2k
2
2
. Therefore more regularization implies smaller norm of solution!
Hint: Observe that J( ¯w1, λ1) ≤ J( ¯w2, λ1), and J( ¯w2, λ2) ≤ J( ¯w1, λ2) (why?).
2. Show that the training error for ¯w1 is less than that of ¯w2. In other words, show that
Xn
i=1
yi − w¯1 · X¯
i
2 ≤
Xn
i=1
yi − w¯2 · X¯
i
2
.
Hint: Use the first part of the problem.
Problem 4 (25 points) Linear and Quadratic Regression. Please refer to the Jupyter Notebook in the assignment, and complete the coding part in it! You can use sklearn regression package:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html