Description
1 Fundamentals
The purpose of this question is to give you practice using the mathematical and coding notation that we will
adopt in the course.
1.1 Matrix Notation
For this question we’ll use the following Householder-like notation:
1. α is a scalar.
2. w, a, and b are d by 1 column-vectors.
1
3. y and v are n by 1 column-vectors (with elements y
i and vi).
4. A is a d by d matrix, not necessarily symmetric (with elements aij ).
5. V is a diagonal matrix with v along the diagonal.
6. B is a diagonal matrix with b along the diagonal.
7. X is a n by d matrix (with rows (x
i
)
T
).
Express the gradient ∇f(w) and Hessian ∇2f(w) of the following functions in matrix notation, simplifying
as much as possible.
1. The linear function
f(w) = w
T
a + α +
X
d
j=1
wjaj .
2. The linear function
f(w) = a
T w + a
T Aw + w
T A
T
b.
3. The quadratic function
f(w) = w
T w + w
T XT Xw +
X
d
i=1
X
d
j=1
wiwjaij .
4. L2-regularized weighted least squares,
f(w) = 1
2
Xn
i=1
vi(w
T x
i − y
i
)
2 +
λ
2
kwk
2
.
5. Weighted L2-regularized probit regression,
f(w) = −
Xn
i=1
log p(y
i
|x
iw) + 1
2
X
d
j=1
bjw
2
j
.
where y
i ∈ {−1, +1} and the likelihood of a single example i is given by
p(y
i
|x
i
, w) = Φ(y
iw
T x
i
).
where Φ is the cumulative distribution function (CDF) of the standard normal distribution.
Hint: You can use the results we showed in class to simplify the derivations. You can use 0 to represent
the zero vector or a matrix of zeroes and I to denote the identity matrix. It will help to convert the fourth
example to matrix notation first. For the fifth example, it is useful to define a vector c containing the CDF
Φ(y
iw
T x
i
) as element ci and a vector p containing the corresponding PDF as element pi
. For the fifth one
you’ll need to define new vectors to express the gradient and Hessian in matrix notation (and remember
the relationship between the PDF and CDF). As a sanity check, make sure that your results have the right
dimension.
1.2 Regularization and Cross-Validation
Download a1.zip from the course webpage, and start Matlab in a directory containing the extracted files. If
you run the script example nonLinear, it will:
2
1. Load a one-dimensional regression dataset.
2. Fit a least-squares linear regression model.
3. Report the test error.
4. Draw a figure showing the training/testing data and what the model looks like.
Unfortunately, this is not a great model of the data, and the figure shows that a linear model is probably
not suitable.
1. Write a function called leastSquaresRBFL2 that implements least squares using Gaussian radial basis
functions (RBFs) and L2-regularization.
You should start from the leastSquares function and use the same conventions: n refers to the number
of training examples, d refers to the number of features, X refers to the data matrix, y refers to the
targets, Z refers to the data matrix after the change of basis, and so on. Note that you’ll have to
add two additional input arguments (λ for the regularization parameter and σ for the Gaussian RBF
variance) compared to the leastSquares function. To make your code easier to understand/debug, you
may want to define a new function rbfBasis which computes the Gaussian RBFs for a given training
set, testing set, and σ value. Hand in your function and the plot generated with λ = 1 and σ = 1.
2. When dealing with larger datasets, an important issue is the dependence of the computational cost on
the number of training examples n and the number of features d. What is the cost in big-O notation of
training the model on n training examples with d features under (a) the linear basis, and (b) Gaussian
RBFs (for a fixed σ)? What is the cost of classifying t new examples under these two bases? Assume
that multiplication by an n by d matrix costs O(nd) and that inverting a d by d linear system costs
O(d
3
).
3. Modify the training/validation procedure to use 10-fold cross-validation on the training set to select λ
and σ. Hand in your cross-validation procedure and the plot you obtain with the best values of λ and
σ
Note: If you find that calculating the Euclidean distances between all pairs of points takes too long, the
following code will form a matrix containing the squared Euclidean distances between all training and test
points:
[n,d] = size(X);
[t,d] = size(Xtest);
D = X.^2*ones(d,t) + ones(n,d)*(Xtest’).^2 – 2*X*Xtest’;
Element D(i, j) gives the squared Euclidean distance between training point i and testing point j.
1.3 MAP Estimation
In class, we showed that under the assumptions
y
i ∼ N (w
T x
i
, 1), wj ∼ N
0,
1
λ
,
the MAP estimate is equivalent to solving the L2-regularized least squares problem
f(w) = 1
2
kXw − yk
2 +
λ
2
kwk
2
,
in the “loss plus regularizer” framework. For each of the alternate assumptions below, write it in the “loss
plus regularizer” framework (simplifying as much as possible, including converting to matrix notation):
3
1. Laplace distribution likelihoods and priors,
y
i ∼ L(w
T x
i
, 1), wj ∼ L
0,
1
λ
.
2. Gaussians with separate variance for each training example and variable,
y
i ∼ N (w
T x
i
, σ2
i
), wj ∼ N
0,
1
λj
.
3. Poisson-distributed likelihood (for the case where y
i
represents discrete counts) and Gaussian prior,
y
i ∼ P(exp(w
T x
i
)), wj ∼ N
0,
1
λ
,
2 Convex Functions
2.1 Minimizing Strictly-Convex Quadratic Functions
Solve for the minimizer of the below strictly-convex quadratic functions:
1. f(w) = 1
2
kw − vk
2
(projection of v onto real space).
2. f(w) = 1
2
kXw − yk
2 +
1
2w
TΛw (least squares with quadratic-norm regularization).
3. f(w) = 1
2
Pn
i=1 vi(w
T xi − yi)
2 +
λ
2
kw − w
0k
2
(weighted least squares shrunk towards non-zero w
0
).
Above we assume that v and w
0 are d by 1 vectors, and Λ is a d by d positive-definite matrix. You can use
V as a diagonal matrix containing the vi values along the diagonal.
2.2 Proving Convexity
Show that the following functions are convex using one of the definitions of convexity (without using the
“operations that preserve convexity” or using convexity results stated in class):
1. Negative logarithm f(w) = − log(aw) w > 0
2. Quadratic with positive semi-definite A f(w) = 1
2w
T Aw + b
T w + γ w ∈ R
d
, A 0
3. Any norm f(w) = kwkp w ∈ R
d
4. Logistic regression f(w) = Pn
i=1 log(1 + exp(−yiw
T xi)) w ∈ R
d
Hint: Norms are not differentiable in general, so you cannot use the Hessian for the third one. For the last
one, you can use the Hessian structure we derived in class.
Use the results above and from class, along with the operations that preserve convexity, to show that the
following functions are convex (with λ ≥ 0):
5. f(w) = kXw − ykp + λkAwkq (regularized regression with arbitrary p-norm and weighted q-norm)
6. f(w) = PN
i=1 max{0, |w
T xi − yi
| − } +
λ
2
kwk
2
2
(support vector regression)
7. f(x) = maxijk{|xi
| + |xj | + |xk|} (3 largest-magnitude elements)
4
2.3 Robust Regression
The script example outliers loads a one-dimensional regression dataset that has a non-trivial number of
‘outlier’ data points. These points do not fit the general trend of the rest of the data, and pull the least
squares model away from the main cluster of points. One way to improve the performance in this setting is
simply to remove or downweight the outliers. However, in high-dimensions it may be difficult to determine
whether points are indeed outliers (or the errors might simply be heavy-tailed). In such cases, it is preferable
to replace the squared error with an error that is more robust to outliers.
1. Write a new function, robustRegression(X,y), that adds a bias variable and fits a linear regression model
by minimizing the absolute error instead of the square error,
f(w) = kXw − yk1.
You should turn this into a linear program as shown in class, and you can solve this linear program
using Matlab’s linprog function. Hand in the new function and report the minimum absolute training
error that is possible on this dataset.
2. There have been several attempts to adapt SVMs to the regression problem. The most common method
for support vector regression uses what iscalled the -insensitive loss,
f(w) = Xn
i=1
max{0, |w
T x
i − y
i
| − }.
Here, is a parameter and notice that the model only penalizes errors larger than . Show how to
write minimizing this objective function as a linear program.
3. Write a new function, svRegression(X,y,epsilon), that minimizes the -insensitive objective. Hand in
the new function and report the absolute training error achieved with = 1..
3 Numerical Optimization
3.1 Gradient Descent and Newton’s Method
The function example gradient loads a simple binary classification dataset, and fits an `2-regularized logistic
regression model to it using the findMin function which implements a simple gradient descent algorithm.
The findMin function is generic in the sense that it only needs an anonymous function which computes
the objective value and gradient given a parameter vector. On each iteration findMin uses a backtracking
line-search to find a step-size α that satisfies the Armijo “sufficient decrease” condition. It always tries α = 1
first, and whenever the condition is not satisfied it uses “cubic Hermite polynomial” interpolation to find a
smaller value of α to try. It continues running the algorithm until a pre-specified number of iterations are
reached or until the norm of the gradient is sufficiently small. On this dataset, the method only requires
9 iterations before it satisfies its optimality condition, although it backtracks 13 times and evaluates the
function and gradient a total of 23 times.
Report the effect on performance (in terms of number of backtracking iterations and total number of iterations) of making the following changes to findMin:
1. When backtracking, replacing the cubic-Hermite interpolation with the simpler strategy of dividing α
in half (as suggested in the Boyd & Vandenberghe book).
2. Instead of resetting α to one after the line-search, set it using the Barzilai-Borwein step-size, given by
α ← −α
v
T ∇f(w)
v
T v
,
5
where v is the new gradient value minus the old gradient value.
3. Fix the step-size α to 1/L, where L is given by
L =
1
4
max{eig(XT X)} + λ,
which is the Lipschitz constant of the gradient.
4. Instead of using the gradient direction, set d to the Newton direction which is given by
d = [∇2
f(w)]−1∇f(w).
For the Newton direction, you’ll need to make a new objective function that returns the Hessian in addition
to the function and gradient, and modify findMin to use the Hessian.
3.2 Hessian-Free Newton
The dataset used in the previous question leads to an easy optimization problem. If you apply the variations
in the previous question to the larger dataset contained in rcv1 train binary.mat then the performance is
very different:
1. Without modifications, findMin doesn’t work since the objective overflows.1
2. Step-size halving avoids the overflow, but the method doesn’t reaches a reasonable accuracy even after
500 iterations.
3. With the Barzilai-Borwein step-size, it reaches a solution in a reasonable number of iterations.
4. Computing L is slower than solving the original problem because d is so large.2
5. Computing the Hessian is slower than solving the original problem because d is so large.
In Hessian-free Newton methods, we compute an approximate Newton direction d without ever forming
the Hessian. The standard way to do this is to use conjugate gradient to solve for d, which only requires
Hessian-vector products of the form ∇2f(w)v for a vector v. Note that Hessian-vector products can always
be computed at a similar cost to computing the gradient. For example, for logistic regression we can use
∇2
f(w)v = XT DXv = XT
(D(Xv)),
and the order of operations leads to a cost of O(nd). This is cheaper than the O(nd2
) cost of forming
the Hessian. Use Matlab’s pcg function to implement a “Hessian-free” Newton’s method, where you use
conjugate gradient to solve the Newton system. Report the output of findMin on rcv1 train binary.mat
when using this strategy and using optTol as the tolerance for pcg.
Hint: In logisticL2, define a function that computes the Hessian-vector product and has the following header:
function [Hv] = Hvfunc(w,v,X,y,lambda)
To define the AFUN argument needed by pcg (which must be a function of v for a fixed w) you can use an
anonymous function:
Hv = @(v)Hvfunc(w,v,varargin{:});
1There are two standard solutions to this problem: we could detect the overflow and backtrack out of regions where it
overflows, or we could use the log-sum-exp trick to evaluate the objective without overflowing.
2An often-faster way to compute the largest eigenvalue is the “power method”.
6
3.3 Multi-Class Logistic Regression
The function example multiClass loads a multi-class classification dataset and fits a “one-vs-all” logistic
regression classifier, then reports the validation error and shows a plot of the data/classifier. The performance
on the validation set is ok, but could be much better. For example, this classifier never even predicts some
of the classes.
Using a one-vs-all classifier hurts performance because the classifiers are fit independently, so there is no
attempt to calibrate the columns of the matrix W. An alternative to this independent model is to use the
softmax probability,
p(y
i
|W, xi
) =
exp(w
T
yix
i
)
Pk
c=1 exp(wT
c x
i)
.
Here c is a possible label and wc is column c of W. Similarly, y
i
is the training label, wyi is column y
i of
W. The loss function corresponding to the negative logarithm of the softmax probability is given by
f(W) = Xn
i=1 ”
−w
T
yix
i + log X
k
c
0=1
exp(w
T
c
0x
i
)
!# .
Make a new function, softmaxClassifier, which fits W using the softmax loss from the previous section instead
of fitting k independent classifiers. Hand in the code and report the validation error.
Hint: you can use the derivativeCheck function to help you debug the gradient of the softmax loss. It can
also help you numerically check your answer to several more of this assignment’s questions.
7