## Description

1 Fundamentals

The purpose of this question is to give you practice using the mathematical and coding notation that we will

adopt in the course.

1.1 Matrix Notation

For this question we’ll use the following Householder-like notation:

1. α is a scalar.

2. w, a, and b are d by 1 column-vectors.

1

3. y and v are n by 1 column-vectors (with elements y

i and vi).

4. A is a d by d matrix, not necessarily symmetric (with elements aij ).

5. V is a diagonal matrix with v along the diagonal.

6. B is a diagonal matrix with b along the diagonal.

7. X is a n by d matrix (with rows (x

i

)

T

).

Express the gradient ∇f(w) and Hessian ∇2f(w) of the following functions in matrix notation, simplifying

as much as possible.

1. The linear function

f(w) = w

T

a + α +

X

d

j=1

wjaj .

2. The linear function

f(w) = a

T w + a

T Aw + w

T A

T

b.

3. The quadratic function

f(w) = w

T w + w

T XT Xw +

X

d

i=1

X

d

j=1

wiwjaij .

4. L2-regularized weighted least squares,

f(w) = 1

2

Xn

i=1

vi(w

T x

i − y

i

)

2 +

λ

2

kwk

2

.

5. Weighted L2-regularized probit regression,

f(w) = −

Xn

i=1

log p(y

i

|x

iw) + 1

2

X

d

j=1

bjw

2

j

.

where y

i ∈ {−1, +1} and the likelihood of a single example i is given by

p(y

i

|x

i

, w) = Φ(y

iw

T x

i

).

where Φ is the cumulative distribution function (CDF) of the standard normal distribution.

Hint: You can use the results we showed in class to simplify the derivations. You can use 0 to represent

the zero vector or a matrix of zeroes and I to denote the identity matrix. It will help to convert the fourth

example to matrix notation first. For the fifth example, it is useful to define a vector c containing the CDF

Φ(y

iw

T x

i

) as element ci and a vector p containing the corresponding PDF as element pi

. For the fifth one

you’ll need to define new vectors to express the gradient and Hessian in matrix notation (and remember

the relationship between the PDF and CDF). As a sanity check, make sure that your results have the right

dimension.

1.2 Regularization and Cross-Validation

Download a1.zip from the course webpage, and start Matlab in a directory containing the extracted files. If

you run the script example nonLinear, it will:

2

1. Load a one-dimensional regression dataset.

2. Fit a least-squares linear regression model.

3. Report the test error.

4. Draw a figure showing the training/testing data and what the model looks like.

Unfortunately, this is not a great model of the data, and the figure shows that a linear model is probably

not suitable.

1. Write a function called leastSquaresRBFL2 that implements least squares using Gaussian radial basis

functions (RBFs) and L2-regularization.

You should start from the leastSquares function and use the same conventions: n refers to the number

of training examples, d refers to the number of features, X refers to the data matrix, y refers to the

targets, Z refers to the data matrix after the change of basis, and so on. Note that you’ll have to

add two additional input arguments (λ for the regularization parameter and σ for the Gaussian RBF

variance) compared to the leastSquares function. To make your code easier to understand/debug, you

may want to define a new function rbfBasis which computes the Gaussian RBFs for a given training

set, testing set, and σ value. Hand in your function and the plot generated with λ = 1 and σ = 1.

2. When dealing with larger datasets, an important issue is the dependence of the computational cost on

the number of training examples n and the number of features d. What is the cost in big-O notation of

training the model on n training examples with d features under (a) the linear basis, and (b) Gaussian

RBFs (for a fixed σ)? What is the cost of classifying t new examples under these two bases? Assume

that multiplication by an n by d matrix costs O(nd) and that inverting a d by d linear system costs

O(d

3

).

3. Modify the training/validation procedure to use 10-fold cross-validation on the training set to select λ

and σ. Hand in your cross-validation procedure and the plot you obtain with the best values of λ and

σ

Note: If you find that calculating the Euclidean distances between all pairs of points takes too long, the

following code will form a matrix containing the squared Euclidean distances between all training and test

points:

[n,d] = size(X);

[t,d] = size(Xtest);

D = X.^2*ones(d,t) + ones(n,d)*(Xtest’).^2 – 2*X*Xtest’;

Element D(i, j) gives the squared Euclidean distance between training point i and testing point j.

1.3 MAP Estimation

In class, we showed that under the assumptions

y

i ∼ N (w

T x

i

, 1), wj ∼ N

0,

1

λ

,

the MAP estimate is equivalent to solving the L2-regularized least squares problem

f(w) = 1

2

kXw − yk

2 +

λ

2

kwk

2

,

in the “loss plus regularizer” framework. For each of the alternate assumptions below, write it in the “loss

plus regularizer” framework (simplifying as much as possible, including converting to matrix notation):

3

1. Laplace distribution likelihoods and priors,

y

i ∼ L(w

T x

i

, 1), wj ∼ L

0,

1

λ

.

2. Gaussians with separate variance for each training example and variable,

y

i ∼ N (w

T x

i

, σ2

i

), wj ∼ N

0,

1

λj

.

3. Poisson-distributed likelihood (for the case where y

i

represents discrete counts) and Gaussian prior,

y

i ∼ P(exp(w

T x

i

)), wj ∼ N

0,

1

λ

,

2 Convex Functions

2.1 Minimizing Strictly-Convex Quadratic Functions

Solve for the minimizer of the below strictly-convex quadratic functions:

1. f(w) = 1

2

kw − vk

2

(projection of v onto real space).

2. f(w) = 1

2

kXw − yk

2 +

1

2w

TΛw (least squares with quadratic-norm regularization).

3. f(w) = 1

2

Pn

i=1 vi(w

T xi − yi)

2 +

λ

2

kw − w

0k

2

(weighted least squares shrunk towards non-zero w

0

).

Above we assume that v and w

0 are d by 1 vectors, and Λ is a d by d positive-definite matrix. You can use

V as a diagonal matrix containing the vi values along the diagonal.

2.2 Proving Convexity

Show that the following functions are convex using one of the definitions of convexity (without using the

“operations that preserve convexity” or using convexity results stated in class):

1. Negative logarithm f(w) = − log(aw) w > 0

2. Quadratic with positive semi-definite A f(w) = 1

2w

T Aw + b

T w + γ w ∈ R

d

, A 0

3. Any norm f(w) = kwkp w ∈ R

d

4. Logistic regression f(w) = Pn

i=1 log(1 + exp(−yiw

T xi)) w ∈ R

d

Hint: Norms are not differentiable in general, so you cannot use the Hessian for the third one. For the last

one, you can use the Hessian structure we derived in class.

Use the results above and from class, along with the operations that preserve convexity, to show that the

following functions are convex (with λ ≥ 0):

5. f(w) = kXw − ykp + λkAwkq (regularized regression with arbitrary p-norm and weighted q-norm)

6. f(w) = PN

i=1 max{0, |w

T xi − yi

| − } +

λ

2

kwk

2

2

(support vector regression)

7. f(x) = maxijk{|xi

| + |xj | + |xk|} (3 largest-magnitude elements)

4

2.3 Robust Regression

The script example outliers loads a one-dimensional regression dataset that has a non-trivial number of

‘outlier’ data points. These points do not fit the general trend of the rest of the data, and pull the least

squares model away from the main cluster of points. One way to improve the performance in this setting is

simply to remove or downweight the outliers. However, in high-dimensions it may be difficult to determine

whether points are indeed outliers (or the errors might simply be heavy-tailed). In such cases, it is preferable

to replace the squared error with an error that is more robust to outliers.

1. Write a new function, robustRegression(X,y), that adds a bias variable and fits a linear regression model

by minimizing the absolute error instead of the square error,

f(w) = kXw − yk1.

You should turn this into a linear program as shown in class, and you can solve this linear program

using Matlab’s linprog function. Hand in the new function and report the minimum absolute training

error that is possible on this dataset.

2. There have been several attempts to adapt SVMs to the regression problem. The most common method

for support vector regression uses what iscalled the -insensitive loss,

f(w) = Xn

i=1

max{0, |w

T x

i − y

i

| − }.

Here, is a parameter and notice that the model only penalizes errors larger than . Show how to

write minimizing this objective function as a linear program.

3. Write a new function, svRegression(X,y,epsilon), that minimizes the -insensitive objective. Hand in

the new function and report the absolute training error achieved with = 1..

3 Numerical Optimization

3.1 Gradient Descent and Newton’s Method

The function example gradient loads a simple binary classification dataset, and fits an `2-regularized logistic

regression model to it using the findMin function which implements a simple gradient descent algorithm.

The findMin function is generic in the sense that it only needs an anonymous function which computes

the objective value and gradient given a parameter vector. On each iteration findMin uses a backtracking

line-search to find a step-size α that satisfies the Armijo “sufficient decrease” condition. It always tries α = 1

first, and whenever the condition is not satisfied it uses “cubic Hermite polynomial” interpolation to find a

smaller value of α to try. It continues running the algorithm until a pre-specified number of iterations are

reached or until the norm of the gradient is sufficiently small. On this dataset, the method only requires

9 iterations before it satisfies its optimality condition, although it backtracks 13 times and evaluates the

function and gradient a total of 23 times.

Report the effect on performance (in terms of number of backtracking iterations and total number of iterations) of making the following changes to findMin:

1. When backtracking, replacing the cubic-Hermite interpolation with the simpler strategy of dividing α

in half (as suggested in the Boyd & Vandenberghe book).

2. Instead of resetting α to one after the line-search, set it using the Barzilai-Borwein step-size, given by

α ← −α

v

T ∇f(w)

v

T v

,

5

where v is the new gradient value minus the old gradient value.

3. Fix the step-size α to 1/L, where L is given by

L =

1

4

max{eig(XT X)} + λ,

which is the Lipschitz constant of the gradient.

4. Instead of using the gradient direction, set d to the Newton direction which is given by

d = [∇2

f(w)]−1∇f(w).

For the Newton direction, you’ll need to make a new objective function that returns the Hessian in addition

to the function and gradient, and modify findMin to use the Hessian.

3.2 Hessian-Free Newton

The dataset used in the previous question leads to an easy optimization problem. If you apply the variations

in the previous question to the larger dataset contained in rcv1 train binary.mat then the performance is

very different:

1. Without modifications, findMin doesn’t work since the objective overflows.1

2. Step-size halving avoids the overflow, but the method doesn’t reaches a reasonable accuracy even after

500 iterations.

3. With the Barzilai-Borwein step-size, it reaches a solution in a reasonable number of iterations.

4. Computing L is slower than solving the original problem because d is so large.2

5. Computing the Hessian is slower than solving the original problem because d is so large.

In Hessian-free Newton methods, we compute an approximate Newton direction d without ever forming

the Hessian. The standard way to do this is to use conjugate gradient to solve for d, which only requires

Hessian-vector products of the form ∇2f(w)v for a vector v. Note that Hessian-vector products can always

be computed at a similar cost to computing the gradient. For example, for logistic regression we can use

∇2

f(w)v = XT DXv = XT

(D(Xv)),

and the order of operations leads to a cost of O(nd). This is cheaper than the O(nd2

) cost of forming

the Hessian. Use Matlab’s pcg function to implement a “Hessian-free” Newton’s method, where you use

conjugate gradient to solve the Newton system. Report the output of findMin on rcv1 train binary.mat

when using this strategy and using optTol as the tolerance for pcg.

Hint: In logisticL2, define a function that computes the Hessian-vector product and has the following header:

function [Hv] = Hvfunc(w,v,X,y,lambda)

To define the AFUN argument needed by pcg (which must be a function of v for a fixed w) you can use an

anonymous function:

Hv = @(v)Hvfunc(w,v,varargin{:});

1There are two standard solutions to this problem: we could detect the overflow and backtrack out of regions where it

overflows, or we could use the log-sum-exp trick to evaluate the objective without overflowing.

2An often-faster way to compute the largest eigenvalue is the “power method”.

6

3.3 Multi-Class Logistic Regression

The function example multiClass loads a multi-class classification dataset and fits a “one-vs-all” logistic

regression classifier, then reports the validation error and shows a plot of the data/classifier. The performance

on the validation set is ok, but could be much better. For example, this classifier never even predicts some

of the classes.

Using a one-vs-all classifier hurts performance because the classifiers are fit independently, so there is no

attempt to calibrate the columns of the matrix W. An alternative to this independent model is to use the

softmax probability,

p(y

i

|W, xi

) =

exp(w

T

yix

i

)

Pk

c=1 exp(wT

c x

i)

.

Here c is a possible label and wc is column c of W. Similarly, y

i

is the training label, wyi is column y

i of

W. The loss function corresponding to the negative logarithm of the softmax probability is given by

f(W) = Xn

i=1 ”

−w

T

yix

i + log X

k

c

0=1

exp(w

T

c

0x

i

)

!# .

Make a new function, softmaxClassifier, which fits W using the softmax loss from the previous section instead

of fitting k independent classifiers. Hand in the code and report the validation error.

Hint: you can use the derivativeCheck function to help you debug the gradient of the softmax loss. It can

also help you numerically check your answer to several more of this assignment’s questions.

7