## Description

1 Convergence Rates

1.1 Gradient Descent

For minimizing a function f, in class we showed that if ∇f is Lipschitz continuous and f is strongly-convex

then gradient descent iterations,

x

t+1 = x

t − αt∇f(x

t

),

with a step-size of αt = 1/L satisfy

f(x

t

) − f(x

∗

) = O(ρ

t

),

for some ρ < 1 (we call this a linear convergence rate). In this question you’ll show some related properties.

1. The rate above is in terms of the function value f(x

t

), but we might also be interested in the convergence

of the iterates x

t

to x

∗

. Show that if f is differentiable and strongly-convex then a convergence rate of

O(ρ

t

) in terms of the function values implies that the iterations have a convergence rate of

kx

t − x

∗

k = O(ρ

t/2

).

2. Consider using a constant step-size αt = α for some positive constant α < 2/L. Show that gradient descent converges linearly under this alternate step-size (you can use the descent lemma). 3. In practice we typically don’t L. A common strategy in this setting is to start with some small guess L 0 that we know is smaller than the true L (usually we take L = 1). On each iteration t, we initialize with L t = L t−1 and we check the inequality f x t − 1 Lt ∇f(x t ) ≤ f(x t ) − 1 Lt k∇f(x t )k 2 . If this is not satisfied, we double L t and test it again. This continues until we have an L t satisfying the inequality. Show that gradient descent with αt = 1/Lt defined in this way has a linear convergence rate of f(x t ) − f(x ∗ ) ≤ 1 − µ 2L [f(x 0 ) − f(x ∗ ). Hint: if a function is L-Lipschitz continuous that it is also L 0 -Lipschitz continuous for any L 0 ≥ L. 1 4. Describe a condition under which the step-sizes in the previous question would give a faster rate than ρ = (1 − µ/L). 1.2 Sign-Based Gradient Descent In some situations it might be hard to accurately compute the elements of the gradient, but we might have access to the sign of the gradient. For this setting, consider a sign-based gradient descent algorithm of the form x t+1 = x t − k∇f(x t )k1 L sign(∇f(x t )), where we define the sign function element-wise as sign(xj ) = +1 xj > 0

0 xj = 0

−1 xj < 0

.

Consider an f that is strongly-convex and is Lipschitz continuous in the ∞-norm, meaning that

f(y) ≤ f(x) + ∇f(x)

T

(y − x) + L∞

2

ky − xk

2

∞,

for all y and x and some L∞.

1. Show that the sign-based gradient descent method satisfies

f(x

t+1) − f(x

∗

) ≤

1 −

µ

L∞

[f(x

t

) − f(x

∗

)].

2. To compare this rate to the rate of gradient descent, we need to know the relationship between the usual

Lipschitz constant L (in the 2-norm) and L∞. Show that the relationship between these constants is

L∞ ≤ L ≤ dL∞.

1.3 Block Coordinate Descent

Consider a problem it makes sense to partition our variables into k disjoint ‘blocks’

x =

x1

x2

x3

.

.

.

xk

,

each of size d/k. Assume that f is strongly-convex, and blockwise strongly-smooth,

∇2

f(w) µI, ∇2

bbf(w) LI,

for all w and all blocks b. Consider a block coordinate descent algorithm where we use the iteration

x

t+1 = x

t −

1

L

(∇f(x

t

) ◦ ebt

),

2

where bt is the block we choose on iteration t, eb is vector of zeros with ones at the locations of block b, and

◦ means element-wise multiplication of the two vectors. (It’s like coordinate descent except we’re updating

d/k variables instead of just one.)

1. Assume that we pick a random block on each iteration, p(bt = b) = 1/k. Show that this method

satisfies

E[f(x

t+1)] − f(x

∗

) ≤

1 −

µ

Lk

[f(x

t

) − f(x

∗

)].

2. Assume that each block b has its own strong-smoothness constant Lb,

∇2

bbf(w) LbI,

so that the strong-smoothness constant from part 1 is given by L = maxb{Lb}. Show that if we sample

the blocks proportional to Lb, p(bt = b) = P

Lb

b0 Lb0

and we use a larger step-size of 1/Lbt

, then we

obtain a faster convergence rate provided that some Lb 6= L.

2 Large-Scale Algorithms

2.1 Coordinate Optimization

The function example logistic loads a dataset and tries to fit an L2-regularized logistic regression model

using coordinate optimization. Unfortunately, if we use Lf as the Lipschitz constant of ∇f, the runtime

of this procedure is O(d

3 + nd2 Lf

µ

log(1/)). This comes from spending O(d

3

) computing Lf , having an

iteration cost of O(nd), and requiring a O(d

Lf

µ

log(1/)) iterations. This non-ideal runtime is also reflected

in practice: the algorithm’s iterations are relatively slow and even after 500 “passes” through the data it

isn’t particularly close to the optimal function value.

1. Modify this code so that the runtime of the algorithm is O(nd Lc

µ

log(1/)), where Lc is the Lipschitz

constant of all partial derivatives ∇if. You can do this by modifying the iterations so they have a cost

O(n) instead of O(nd), and instead of using a step-size of 1/Lf they use a step-size of 1/Lc (which is

given by 1

4 maxj{kxjk

2} + λ). Hand in your code and report the final function value and total time.

2. To further improve the performance, make a new version of the code which samples the variable to

update jt proportional to the individual Lipschitz constants Lj of the coordinates, and use a stepsize of 1/Ljt

. You can use the function sampleDiscrete to sample from discrete distribution given the

probability mass function. Hand in your code, and report the final function value as well as the number

of passes.

3. Report the number of passes the algorithm takes as well as the final function value if you use uniform

sampling but use a step-size of 1/Ljt

.

4. Suppose that when we use a step-size of 1/Ljt

, we see that uniform sampling outperforms Lipschitz

sampling. Why would this be consistent with the bounds we stated in class?

2.2 Proximal-Gradient

If you run the demo example group, it will load a dataset and fit a multi-class logistic regression (softmax)

classifier. This dataset is actually linearly-separable, so there exists a set of weights W that can perfectly

classify the training data (though it may be difficult to find a W that perfectly classifiers the validation

data). However, 90% of the columns of X are irrelevant. Because of this issue, when you run the demo you

find that the training error is 0 while the test error is something like 0.2980.

3

1. Write a new function, softmaxClassifierL2, that fits a multi-class logistic regression model with L2-

regularization (this only involves modifying the objective function). Hand in the modified loss function

and report the best validation error achievable with λ = 10 (which is best value among powers to 10).

Also report the number of non-zero parameters in the model and the number of original features that

the model uses.

2. While L2-regularization reduces overfitting a bit, it still uses all the variables even though 90% of

them are irrelevant. In situations like this, L1-regularization may be more suitable. Write a new

function, softmaxClassifierL1, that fits a multi-class logistic regression model with L1-regularization.

You can use the function proxGradL1, which minimizes the sum of a differentiable function and an

L1-regularization term. Report the number of non-zero parameters in the model and the number of

original features that the model uses.

3. L1-regularization achieves sparsity in the model parameters, but in this dataset it’s actually the original

features that are irrelevant. We can encourage sparsity in the original features by using group L1-

regularization. Write a new function, proxGradGroupL1, to allow (disjoint) group L1-regularization.

Use this within a new function, softmaxClassiferGL1, to fit a group L1-regularized multi-class logistic

regression model (where rows of W are grouped together and we use the L2-norm of the groups). Hand

in both modified functions (softmaxClassifierGL1 and proxGradGroupL1 ) and report the validation

error achieved with λ = 10. Also report the number of non-zero parameters in the model and the

number of original features that the model uses.

2.3 Stochastic Gradient

If you run the demo example stochastic, it will load a dataset and try to fit an L2-regularized logistic

regression model using 10 “passes” of stochastic gradient using the step-size of αt = 1/λt that is suggested

in many theory papers. Note that the demo is quite slow as Matlab doesn’t do well with ‘for’ loops, but if

you implemented this in C this would be very fast even though there are 50,000 training examples.

Unfortunately, even if we ignore the Matlab-slowness, the performance of this stochastic gradient method

is atrocious. It often goes to areas of the parameter space with the objective function overflows and the

final value is usually in the range of something like 6.5 − 7.5 × 104

. This is quite far from the solution of

2.7068 × 104 and is even worse than just choosing w = 0 which gives 3.5 × 104

. (This is unlike gradient

descent and coordinate optimization, which never increase the objective function.)

1. Although αt = 1/λ gives the best possible convergence rate in the worst case, in practice it’s typically

horrible (as we’re not usually opitmizing the hardest possible λ-strongly convex function). Experiment

with different choices of step-size to see if you can get better performance. Report the step-size that

you found gave the best performance, and the objective function value obtained by this strategy for

one run.

2. Besides tuning the step-size, another strategy that often improves the performance is using a (possiblyweighted) average of the iterations w

t

. Explore whether this strategy can improve performance. Report

the performance with an averaging strategy, and the objective function value obtained by this strategy

for one run. (Note that the best step-size sequence with averaging might be different than without

averaging.)

3. A popular variation on stochastic is AdaGrad, which uses the iteration

w

t+1 = w

t − αtDt∇f(x

t

),

where the element in position (i, i) of the diagonal matrix Dt is given by 1/

q

δ +

Pt

t=0 ∇f(x

t) (and

we don’t average the steps). Implement this algorithm and experiment with the tuning parameters

4

αt and δ. Hand in your code as well as the best step-size sequence you found and again report the

performance for one run.

4. Impelement the SAG algorithm with a step-size of 1/L, where the L is the maximum Lipschitz constant

across the training examples (L = 0.25 maxi{kx

ik

2} + λ). Hand in your code and again report the

performance for one run.

3 Kernels and Duality

3.1 Fenchel Duality

Recall that the Fenchel dual for the primal problem

P(w) = f(Xw) + g(w),

is the dual problem

D(z) = −f

∗

(−z) − g

∗

(XT

z),

or if we re-parameterize in terms of −z:

D(z) = −f

∗

(z) − g

∗

(−XT

z), (1)

where f

∗ and g

∗ are the convex conjugates. Convex conjugates are discussed in Section 3.3 of Boyd and Vandenberghe (http://stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf). Read this, then derive the Fenchel

dual for the following problems:

1. P(w) = 1

2

kXw − yk

2 +

λ

2

kwk

2

(L2-regularized least squares)

2. P(w) = kXw − yk1 + λkwk1 (robust regression with L1-regularization)

3. P(w) = PN

i=1 log(1 + exp(−y

iw

T x

i

)) + λ

2

kwk

2

(regularized maximum entropy)

Hint: Don’t try to take the Lagrangian dual, a generic strategy to compute the special case of Fenchel duals

is as follows:

• Determine X, f, and g to put the problem into the primal format.

• Determine the form of f

∗ and g

∗

(note that A here is not relevant).

• Evaluate f

∗ at −z and g

∗ at XT

z to get the final form.

For a differentiable f, you can often solve for the value achieving the sup inside of f

∗

(v) by taking the

gradient of (x

T v − f(x)) and setting it to zero (keeping in mind whether there are values of v where the

sup might be infinity). Example 3.26 in the book gives the convex conjugate in the case where f is a norm.

Section 3.3.2 of the book shows how the convex conjugate changes if you scale a function and/or compose a

function with an affine transformation. For parts 1 and 2, the X in the primal will just be the data matrix

X. But for part 3, it will be easier if you define X as a matrix with row i is given by y

ix

i

. For part 3 you’ll

want to use f(v) = Pn

i=1 log(1 + exp(vi)), which is a separable function (meaning that you can optimize

each zi

independently).

3.2 Stochastic Dual Coordinate Ascent

The dual of the SVM problem,

P(w) = X

N

i=1

max{0, 1 − y

iw

T x

i

} +

λ

2

kwk

2

,

5

is

D(z) = e

T

z −

1

2λ

z

T Y XXT Y z, s.t. 0 ≤ zi ≤ 1, ∀i

.

where e is a vector of ones, Y is diagonal matrix with the y

i values along the diagonal, and we have

w

∗ =

1

λXT Y z∗

. Starting from example dual.m, implement a dual coordinate optimization strategy to

optimize the SVM objective. Hand in your code, report the optimal value of the primal and dual objectives

with λ = 1, and report the number of support vectors

Hint: the objective function is a quadratic and the constraints are just lower and upper bounds. This lets

you derive the optimal update for one variable with the other held fixed: solve for the value of zi with a

partial derivative of zero, and if this violates the constraints then the solution must be either zi = 0 or zi = 1

(depending on which one is lower).

3.3 Large-Scale Kernel Methods

The function kernelRegression.m implements kernel regression with the squared error, L2-regularizer, and

Gaussian kernel. If you run your cross-validation code from Assignment 1 Question 1.2, you’ll find that it

achieves similar performance to using Gaussian RBFs.

1. Report the λ and σ reported using cross-validation on this previous assignment question. What are the

(approximate) relationships between λ and σ between the two models (the one with Gaussian RBFs

and the other with Gaussian kernels).

2. Implement the subset of regressors model for large-scale kernel methods we discussed in class. Hand in

your code and report the qualitative performance (describe how well the model fits the data visually)

for small and large values of the number of regressors m.

3. Implement the random kitchen sink model for large-scale kernel methods we discussed in class. Hand

in your code and contrast the performance of this method with the subset of regressors model, for both

large and small m.

6