## Description

1. An exponential loss function f(w) is defined as

f(w) = (

e

−2(w−1), w < 1

e

w−1

, w ≥ 1

a) Is f(w) convex? Why? Hint: Graph the function.

b) Is f(w) differentiable everywhere? If not, where not?

c) The “differential set” ∂f(w) is the set of subgradients v ∈ ∂f(w) for which

f(u) ≥ f(w) + (u − w)

T v. Find the differential set for f(w) as a function of w.

2. We are trying to predict whether a certain chemical reaction will take place as a function

of our experimental conditions: temperature, pressure, concentration of catalyst, and

several other factors.

For each experiment i = 1, . . . , m we record the experimental

conditions in the vector xi ∈ R

n and the outcome in the scalar bi ∈ {−1, 1} (+1 if the

reaction occurred and −1 if it did not). We will train our linear classifier to minimize

hinge loss.

Namely, we solve:

minimize

w

Xm

i=1

(1 − bix

T

i w)+ where (u)+ = max(0, u) is the hinge loss operator

a) Derive a gradient descent method for solving this problem. Explicitly give the

computations required at each step. Note: you may ignore points where the

function is non-differentiable.

b) Explain what happens to the algorithm if you land at a wk

that classifies all the

points perfectly, and by a substantial margin.

3. You have four training samples y1 = 1, x1 =

”

1

−1

#

, y2 = 2, x2 =

”

1

−2

#

, y3 =

−1, x3 =

”

−1

0

#

, and y4 = −2, x4 =

”

−2

1

#

. Use cyclic stochastic gradient descent

to find the first two updates for the LASSO problem

min

w

||y − Xw||2

2 + 2||w||1

assuming a step size of τ = 1 and w(0) = 0. Also indicate the data used for the first

six updates.