Description
Problem 1 (Linear regression; 20 points). In this problem, we’ll consider an optimization formulation for linear regression. This is a supervised learning problem where the output space is Y = R
(rather than a discrete space like {0, 1}). Instead of classification error, the goal here is to learn a
function f : R
d → R so as to minimize the expected squared error with respect to a distribution P
over R
d × R: E[(f(X) − Y )
2
] (the expectation is taken with respect to the random couple (X, Y )
which has distribution P). We consider the case where f is a linear function represented by a weight
vector w ∈ R
d
. Let S be an i.i.d. sample from P; consider the following optimization problem:
min
w∈Rd
λ
2
kwk
2
2 +
1
|S|
X
(x,y)∈S
hw, xi − y
2
.
(Above, λ > 0 is assumed to be a positive scalar.)
(a) Prove that the Hessian of the objective function, at any point w ∈ R, is positive definite.
Explain why this establishes that the optimization problem is convex.
(b) Derive a gradient descent algorithm for solving the optimization problem (following the recipe
from lecture). Give concise and unambiguous pseudocode for the algorithm, and be explicit
about how to compute gradients. You may use vector addition, scaling, and inner products
as primitive operations (in addition to usual arithmetic operations). Furthermore, assume
the initial solution, step sizes, and number of iterations are provided as inputs.
(c) Suppose we add the following constraints to the optimization problem:
w
2
i ≤ 1 for all i = 1, 2, . . . , d .
Is the optimization problem still convex? Briefly explain why or why not.
(d) Same as (c), but with the following constraints instead (assuming d is even):
w2i−1 = 1 − w2i for all i = 1, 2, . . . , d/2 .
(Hint: can you write an equality constraint as a pair of inequality constraints?)
(e) Same as (c), but with the following constraints instead:
w
2
i = 1 for all i = 1, 2, . . . , d .
Problem 2 (Logistic regression; 30 points). Let S be a training data set from R
d × {0, 1} for a
binary classification problem. Consider the following optimization problem for computing the MLE
of the logistic regression parameters:
min
β0∈R,β∈Rd
1
n
Xn
i=1
ln(1 + exp(β0 + hβ, xii)) − yi(β0 + hβ, xii)
.
(a) Derive a gradient descent algorithm for solving the optimization problem. Give concise and
unambiguous pseudocode for your algorithm, and be explicit about how to compute gradients.
You may use vector addition, scaling, and inner products as primitive operations (in addition
to usual arithmetic operations); and natural logarithm (ln) and exponential (exp) functions
as subroutines. Furthermore, assume the initial solution, step sizes, and number of iterations
are provided as inputs.
(b) Implement the gradient descent algorithm from part (a), except use β0 = 0 and β = 0 as the
initial solution, choose the step sizes using a line search, and use as many iterations as are
required to achieve a prescribed objective value. Run your gradient descent code on the data
set hw4data.mat from Courseworks (which has training features vectors and labels stored
as data and labels, respectively). Roughly how many iterations are needed to achieve an
objective value that is at most 0.65064?1
(c) The feature vectors in the data set from hw4data.mat are three-dimensional, so they are
relatively easy to inspect. Investigate the data by plotting it and/or computing some statistics
about the features. Do you notice anything peculiar about the features? Use what you
discover to design a simple invertible linear transformation of the feature vectors xi
7→ Axi
such that running your gradient descent code with this transformed data set {(Axi
, yi)}
n
i=1
reaches an objective value of 0.65064 in (many) fewer iterations.
Describe the steps you took in this investigation, and your reasoning behind them, as well as
your chosen linear transformation (represented as a 3×3 matrix). State roughly how many
iterations were required to achieve this stated objective value.
(d) Modify your gradient descent code in the following way.
• Use only the first b0.8nc examples to define the objective function; keep the remaining
n − b0.8nc examples as a hold-out set.2
• The stopping condition is as follows. After every power-of-two iterations of gradient
descent, record the hold-out error rate for the linear classifier based on the current
(β0, β). If this hold-out error rate is more than 0.99 times that of the best hold-out error
rate previously computed, and the number of iterations executed is at least 32 (which is
somewhat of an arbitrary number), then stop.
Run this modified gradient descent code on the original hw4data.mat data, as well as the
linearly transformed data (from part (c)). In each case, report: (1) the number of iterations
executed, (2) the final objective value, and (3) the final hold-out error rate.
Please also submit your gradient descent source code (both versions) in separate files.
1The actual minimum value is less than 0.65064.
2Normally you would not simply select the first b0.8nc examples, but rather pick a random subset of b0.8nc
examples. But here, the order of the examples has already been randomized.
2