Description
1. Logistic Regression (20 pts)
In this question, we will study logistic regression, which is a popular method in machine
learning. We let w ∈ R
p be our decision variable. w represents weights for the columns of the
n × p data matrix X, where the i
th row of X is x
T
i
. Let
σ(a) ,
1
1 + e−a
=
e
a
1 + e
a
be the sigmoid/logistic function. This function is non-convex. However − log(σ(a)) is convex,
which arises in maximum likelihood estimation as we describe next.
We assume that we collect data xi
, and a binary response variable yi
, which is the label,
where
yi =
(
+1 with probability σ(w
T xi)
−1 with probability 1 − σ(w
T xi)
. (1)
We can write the above in the compact form p(yi
|w, xi) = σ(yiw
T xi) since σ(a) = 1 −
σ(−a). If we collect the observations {yi}
n
i=1, then the probability of observing this outcome
is p(y|w, X) = Qn
i=1 p(yi
|w, xi) and so the negative log-likelihood, which we will use as our
objective function, is
`(w) , − log(p(y|w, X)) = Xn
i=1
log(1 + e
−yiwT xi
).
Note that minimizing negative log-likelihood is same as maximizing the likelihood. This
corresponds to maximum likelihood estimation of the parameter w. Once w is identified, we
can use (1) to infer the label of a test data point x.
(a) Derive the gradient ∇`(w).
(b) Derive the Hessian ∇2
`(w).
(c) Is the cost function `(w) convex?
2. Logistic Regression for Spam E-mail Classification (20 pts)
Download the email spam data set , which is available at https://github.com/probml/
pmtk3/tree/master/data/spamData in Matlab and plain text format. This set consists of
4601 email messages, from which 57 features have been extracted. These are as follows
• 48 features, in [0 100], giving the percentage of words in a given message which match a
given word on the list. The list contains words such as “business”, “free”, “george”, etc.
• 6 features, in [0 100], giving the percentage of characters in the email that match a given
character on the list. The characters are ; ( [ ! $ #
• Feature 55: The average length of an uninterrupted sequence of capital letters (max is
40.3, mean is 4.9)
• Feature 56: The length of the longest uninterrupted sequence of capital letters ( max is
45, mean is 52.6)
• Feature 57: The sum of the lengths of uninterrupted sequence of capital letters (max is
25.6, mean is 282.2)
Load the data set using the provided link. In the Matlab version, there is a training set
of size 3065 and a test set of size 1536. In the plain text version, there is no predefined
testing/training set, you can randomly shuffle the data to pick a training set of size 3065 and
use the rest for testing (or you can use the Matlab data along with scipy.io.loadmat).
There are different methods to pre-process the data, e.g., standardize the columns of X.
For this problem, transform the features using log(xij + 0.1). One could also add some
regularization to the loss function which can help generalization error but this is not necessary.
Also note that you need to transform the labels from {0, 1} to {1, −1}.
(a) Run gradient descent with a fixed step-size. Plot the value of the cost function at each
iteration and find a reasonable step-size for fast convergence
(b) Repeat the previous part using gradient descent with momentum.
(c) Implement gradient descent with Armijo line search. This procedure is as follows: Assume that we are at the point xk and have a search direction pk (for gradient descent
pk = −∇f(xk)). Then, the Armijo line search procedure is:
• Pick an initial step-size t
• Initialize the parameters 0 < ρ < 1 and 0 < c < 1 (typical values are c = 10−4 and
ρ = 0.9)
• While f(xk + tpk) > f(xk) + ct∇f(xk)
T pk, do t ← ρt
• Terminate the procedure if f(xk + tpk) ≤ f(xk) + ct∇f(xk)
T pk
The test statement in the while loop is the Armijo condition. If pk = −∇f(xk), then the
test is accepted when f(xk + tpk) ≤ f(xk) − ctk∇f(xk)k
2
2
. In general, the second term
is negative as long as pk is a descent direction. One can prove this linesearch procedure
will terminate.
Find a good estimate for the initial step-size by trial and error. A simple idea is to use
the final step-size from the previous step, but this can be unnecessarily small. You may
want to do this, but increase the step-size by a factor of 2.
3. Newton’s Method is Affine Invariant (20 pts)
In this question, we will prove the affine invariance of Newton’s method. Let f : R
n → R be
a convex function. Consider an affine transform y → Ay + b, where A ∈ R
n×n
is invertible
and b ∈ R
n
. Define the function g : R
n → R by g(y) = f(Ay + b). Denote by x
(k)
the k
th
iterate of Newton’s method performed on f. Denote y
(k)
the k
th iterate of Newton’s method
performed on g.
(a) Show that if x
(k) = Ay(k) + b, then x
(k+1) = Ay(k+1) + b.
(b) Show that Newton’s decrement does not depend on the coordinates, i.e., show that
λ(x
(k)
) = λ(y
(k)
), where λ(x) = (∇f(x)
T ∇2f(x)
−1∇f(x))1/2
.
Together, this implies that Newton’s method is affine invariant. As an important consequence,
Newton’s method cannot be improved by a change of coordinates, unlike gradient descent.
4. Newton’s Method for Convex Optimization (20 pts)
(a) Implement Newton’s method for the logistic regression problem in Problem 1. Plot
the value of the cost function at each iteration and find a reasonable step-size for fast
convergence.
(b) Implement randomized Newton’s method with uniform sampling sketch, i.e., sampling
rows of H1/2 uniformly at random where H and H1/2 denote Hessian and its square-root
respectively. Plot the value of the cost function at each iteration and find a reasonable
step-size and sketch-size for fast convergence.
5. Fast Johnson-Lindenstrauss Transform (FJLT) using Hadamard Matrices (20 pts)
(a) Construct an 128 × 1024 FJLT matrix as follows
Set m = 128 and n = 1024
Define H1 :=
1 1
1 −1
Construct H10 ∈ R
1024,1024 recursively via Hk+1 =
Hk Hk
Hk −Hk
Generate D as an n×n diagonal matrix of uniformly random ±1 variables (Rademacher
distribution)
Generate an m × n uniform sub-sampling matrix P scaled with
√
n √
m
(uniform sampling sketch)
Form the FJLT matrix S = √
1
n
P HD.
(b) Verify that S
T S is a multiple of identity. Scale S appropriately if needed to obtain
S
T S = I.
(c) Generate a data matrix A of size 1024 × 10 using i.i.d. standard Gaussian variables.
Plot the singular values of A and singular values of SA.
(d) In part (c), SA is a Johnson-Lindenstrauss embedding of 10 vectors (1024 dimensional
column vectors of A) to dimension 128. Verify that the pairwise distances are approximately preserved, i.e., there exists an > 0
(1 − )kAi − Ajk
2
2 ≤ kS(Ai − Aj )k
2
2 ≤ (1 + )kAi − Ajk
2
2 ∀i, j , (2)
where Ai
is the i-th column of A. Find
∗
, the smallest value of that satisfy (2) for
a single realization of the random construction. Note that the matrices D and P are
constructed randomly, while H is deterministic. What is the minimum, maximum, and
mean value of
∗
in 100 random realizations of the construction?
Hint: The JL embedding property in (2) specifies 2d
2
linear inequalities of the form
≥ cij , hence the smallest is
∗ = maxij cij .
(e) Generate a data matrix A of size 1024 × 10, and a vector b of size 1024 × 1 using i.i.d.
standard Gaussian variables. Solve the least squares problem
x
∗ = arg min
x
kAx − bk
2
2
.
Apply the FJLT to A and b as SA and Sb. Solve the sketched least squares problem
x˜ := arg min
x
kSAx − Sbk
2
2
.
Find the Euclidean distance between the solutions, i.e., kx˜ − x
∗k2, between their predictions, i.e., kAx˜ − Ax∗k2 and the approximation ratio (kAx˜ − bk
2
2
)/(kAx∗ − bk
2
2
) of the
objective value.