Description
1 Introduction
The goal of this problem set is to get more comfortable with the multiclass hinge loss and multiclass SVM. In several problems below, you are asked to justify that certain functions are convex. For these problems, you may use any of the rules about convex functions described in our
notes on Convex Optimization (https://davidrosenberg.github.io/mlcourse/Notes/
convex-optimization.pdf) or in the Boyd and Vandenberghe book. In particular, you will
need to make frequent use of the following result: If f1, . . . , fm : Rn → R are convex, then their
pointwise maximum
f(x) = max {f1(x), . . . , fm(x)}
is also convex.
2 Convex Surrogate Loss Functions
It’s common in machine learning that the loss functions we really care about lead to optimization
problems that are not computationally tractable. The 0/1 loss for binary classification is one such
example1
. Since we have better machinery for minimizing convex functions, a standard approach is
to find a convex surrogate loss function. A convex surrogate loss function is a convex function
that is an upper bound for the loss function of interest2
. If we can make the upper bound small,
1
Interestingly, if our hypothesis space is linear classifiers and we are in the “realizable” case, which means that
there is some hypothesis that achieves 0 loss (with the 0/1 loss), then we can efficiently find a good hypothesis using
linear programming. This is not difficult to see: each data point gives a single linear constraint, and we are looking
for a vector that satisfies the constraints for each data point.
2At this level of generality, you might be wondering: “A convex function of WHAT?”. For binary classification, we
usually are talking about a convex function of the margin. But to solve our machine learning optimization problems,
we will eventually need our loss function to be a convex function of some w ∈ Rd that parameterizes our hypothesis
space. It’ll be clear in what follows what we’re talking about.
1
then the loss we care about will also be small3
. Below we will show that the multiclass hinge loss
based on a class-sensitive loss ∆ is a convex surrogate for the multiclass loss function ∆, when
we have a linear hypothesis space. We’ll start with a special case, that the hinge loss is a convex
surrogate for the 0/1 loss.
2.1 Hinge loss is a convex surrogate for 0/1 loss
1. Let f : X → R be a classification score function for binary classification.
(a) For any example (x, y) ∈ X × {−1, 1}, show that
1(y 6= sign(f(x)) ≤ max {0, 1 − yf(x)} ,
where sign (x) =
1 x > 0
0 x = 0
−1 x < 0
.
(b) Show that the hinge loss max {0, 1 − m} is a convex function of the margin m.
(c) Suppose our prediction score functions are given by fw(x) = w
T x. The hinge loss of fw
on any example (x, y) is then max
0, 1 − ywT x
. Show that this is a convex function
of w.
2.2 Generalized Hinge Loss
Consider the multiclass output space Y = {1, . . . , k}. Suppose we have a base hypothesis space H =
{h : X × Y → R} from which we select a compatibility score function. Then our final multiclass
hypothesis space is F =
f(x) = arg maxy∈Y h(x, y) | h ∈ H
. Since functions in F map into Y,
our action space A and output space Y are the same. Nevertheless, we will write our class-sensitive
loss function as ∆ : Y × A → R, even though Y = A. We do this to indicate that the true class
goes in the first slot of the function, while the prediction (i.e. the action) goes in the second slot.
This is important because we do not assume that ∆(y, y0
) = ∆(y
0
, y). It would not be unusual to
have this asymmetry in practice. For example, false alarms may be much less costly than no alarm
when indeed something is going wrong.
Our ultimate goal would be to find f ∈ F minimizing the empirical cost-sensitive loss:
min
f∈F
Xn
i=1
∆ (yi
, f(xi)).
Since binary classification with 0/1 loss is both intractable and a special case of this formulation,
we know that this more general formulation must also be computationally intractable. Thus we are
looking for a convex surrogate loss function.
1. Suppose we have chosen an h ∈ H, from which we get the decision function f(x) = arg maxy∈Y h(x, y).
Justify that for any x ∈ X and y ∈ Y, we have
h(x, y) ≤ h(x, f(x)).
3This is actually fairly weak motivation for a convex surrogate. Much better motivation comes from the more
advanced theory of classification calibrated loss functions. See Bartlett et al’s paper “Convexity, Classification,
and Risk Bounds.” http://www.eecs.berkeley.edu/~wainwrig/stat241b/bartlettetal.pdf
2
2. Justify the following two inequalities:
∆ (y, f(x)) ≤ ∆ (y, f(x)) + h(x, f(x)) − h(x, y)
≤ max
y0∈Y
[∆ (y, y0
) + h(x, y0
) − h(x, y)]
The RHS of the last expression is called the generalized hinge loss:
` (h,(x, y)) = max
y0∈Y
[∆ (y, y0
) + h(x, y0
) − h(x, y)] .
We have shown that for any x ∈ X , y ∈ Y, h ∈ H we have
` (h,(x, y)) ≥ ∆(y, f(x)),
where, as usual, f(x) = arg maxy∈Y h(x, y). [You should think about why we cannot write
the generalized hinge loss as ` (f,(x, y)).]
3. We now introduce a specific base hypothesis space H of linear functions. Consider a classsensitive feature mapping Ψ : X × Y → Rd
, and H =
hw (x, y) = hw, Ψ(x, y)i | w ∈ Rd
.
Show that we can write the generalized hinge loss for hw(x, y) on example (xi
, yi) as
` (hw,(xi
, yi)) = max
y∈Y
[∆ (yi
, y) + hw, Ψ(xi
, y) − Ψ(xi
, yi)i] .
4. We will now show that the generalized hinge loss ` (hw,(xi
, yi)) is a convex function of w.
Justify each of the following steps.
(a) The expression ∆(yi
, y) + hw, Ψ(xi
, y) − Ψ(xi
, yi)i is an affine function of w.
(b) The expression maxy∈Y [∆ (yi
, y)) + hw, Ψ(xi
, y) − Ψ(xi
, yi)i] is a convex function of w.
5. Conclude that ` (hw,(xi
, yi)) is a convex surrogate for ∆(yi
, fw(xi)).
3 SGD for Multiclass SVM
Suppose our output space and our action space are given as follows: Y = A = {1, . . . , k}. Given a
non-negative class-sensitive loss function ∆ : Y × A → R≥0 and a class-sensitive feature mapping
Ψ : X × Y → Rd
. Our prediction function f : X → Y is given by
fw(x) = arg max
y∈Y
hw, Ψ(x, y)i
1. For a training set (x1, y1), . . .(xn, yn), let J(w) be the `2-regularized empirical risk function
for the multiclass hinge loss. We can write this as
J(w) = λkwk
2 +
1
n
Xn
i=1
max
y∈Y
[∆ (yi
, y) + hw, Ψ(xi
, y) − Ψ(xi
, yi)i] ,
for some λ > 0. We will now show that J(w) is a convex function of w. Justify each of
the following steps. As we’ve shown it in a previous problem, you may use the fact that
w 7→ maxy∈Y [∆ (yi
, y) + hw, Ψ(xi
, y) − Ψ(xi
, yi)i] is a convex function.
3
(a) 1
n
Pn
i=1 maxy∈Y [∆ (yi
, y) + hw, Ψ(xi
, y) − Ψ(xi
, yi)i] is a convex function of w.
(b) kwk
2
is a convex function of w.
(c) J(w) is a convex function of w.
2. Since J(w) is convex, it has a subgradient at every point. Give an expression for a subgradient
of J(w). You may use any standard results about subgradients, including the result from an
earlier homework about subgradients of the pointwise maxima of functions. (Hint: It may be
helpful to refer to yˆi = arg maxy∈Y [∆ (yi
, y) + hw, Ψ(xi
, y) − Ψ(xi
, yi)i].)
3. Give an expression the stochastic subgradient based on the point (xi
, yi).
4. Give an expression for a minibatch subgradient, based on the points (xi
, yi), . . . ,(xi+m−1, yi+m−1).
4 [OPTIONAL] Another Formulation of Generalized Hinge
Loss
In lecture we defined the margin of the compatibility score function h on the ith example (xi
, yi)
for class y as
mi,y(h) = h(xi
, yi) − h(xi
, y),
and the loss on an individual example (xi
, yi) to be:
max
y
[∆(yi
, y) − mi,y(h)]+
.
Here we investigate whether this is just an instance of the generalized hinge loss ` (h,(x, y)) defined
above.
1. Show that ` (h,(xi
, yi)) = maxy0∈Y [∆ (yi
, y0
) − mi,y0 (h)]. (In other words, it looks just like
loss above, but without the positive part.)
2. Suppose ∆ (y, y0
) ≥ 0 for all y, y0 ∈ Y. Show that for any example (xi
, yi) and any score
function h, the multiclass hinge loss we gave in lecture and the generalized hinge loss presented
above are equivalent, in the sense that
max
y∈Y
[∆(yi
, y) − mi,y(h)]+
= max
y∈Y
(∆(yi
, y) − mi,y(h)).
(Hint: This is easy by piecing together other results we have already attained regarding the
relationship between ` and ∆.)
3. In the context of the generalized hinge loss, ∆(y, y0
) is like the “target margin” between the
score for true class y and the score for class y
0
. Suppose that our prediction function f gets
the correct class on xi
. That is, f(xi) = arg maxy0∈Y h(xi
, y0
) = yi
. Furthermore, assume
that all of our target margins are reached or exceeded. That is
mi,y(h) = h(xi
, yi) − h(xi
, y) ≥ ∆(yi
, y),
for all y 6= yi
. It seems like in this case, we should have 0 loss. This is almost the case. Show
that ` (h,(xi
, yi)) = 0 if we assume that ∆ (y, y) = 0 for all y ∈ Y.
4
5 [OPTIONAL] Hinge Loss is a Special Case of Generalized
Hinge Loss
Let Y = {−1, 1}. Let ∆(y, yˆ) = 1(y 6= ˆy). If g(x) is the score function in our binary classification
setting, then define our compatibility function as
h(x, 1) = g(x)/2
h(x, −1) = −g(x)/2.
Show that for this choice of h, the multiclass hinge loss reduces to hinge loss:
` (h,(x, y)) = max
y0∈Y
[∆ (y, y0
)) + h(x, y0
) − h(x, y)] = max {0, 1 − yg(x)}
6 Multiclass Classification – Implementation
In this problem we will work on a simple three-class classification example, similar to the one given
in lecture. The data is generated and plotted for you in the skeleton code.
6.1 One-vs-All (also known as One-vs-Rest)
In this problem we will implement one-vs-all multiclass classification. Our approach will assume we
have a binary base classifier that returns a score, and we will predict the class that has the highest
score.
1. Complete the class OneVsAllClassifier in the skeleton code. Following the OneVsAllClassifier
code is a cell that extracts the results of the fit and plots the decision region. Include these
results in your submission.
2. [Optional] Normalize the vectors corresponding to each of the linear SVM classifiers so that
they have unit norm. Evaluate the results and plot the decision regions for these normalized
vectors.
3. [Optional] You may notice that every time you run the cell that fits the models and plots
the decision regions, you get slightly different results. This is more pronounced when C is
larger, e.g. C = 200. Investigate and propose an explanation for this. You may use any
means necessary, including google searching and asking other people (just like in real life).
[It’s generally good to investigate things that look odd – you’ll almost always learn something,
and sometimes you’ll uncover a more serious underlying problem.]
6.2 Multiclass SVM
In this question, we will implement stochastic subgradient descent for the linear multiclass SVM
described in lecture and in this problem set. We will use the class-sensitive feature mapping approach with the “multivector construction”, as described in our multiclass classification lecture and
in SSBD Section 17.2.1.
1. Complete the skeleton code for multiclass SVM. Following the multiclass SVM implementation, we have included another block of test code. Make sure to include the results from these
tests in your assignment, along with your code.
5
7 [Optional] Audio Classification
In this problem, we will work on the urban sound dataset URBANSOUND8K from the Center for
Urban Science and Progress (CUSP) at NYU. We will first extract features from raw audio data
using the LibROSA package, and then we will train multiclass classifiers to classify the sounds
into 10 sound classes. URBANSOUND8K dataset contains 8732 labeled sound excerpts. For this
problem, you may use the file UrbanSound8K.csv to randomly sample 2000 examples for training
and 2000 examples for validation.
1. In LibROSA, there are many functions for visualizing audio waves and spectra, such as display.waveplot() and display.specshow(). Load a random audio file from each class as a floating
point time series with librosa.load(), and plot their waves and linear-frequency power spectrogram. If you are interested, you can also play the audio in the notebook with functions
display() and Audio() in IPython.display.
2. Mel-frequency cepstral coefficients (MFCC) are a commonly used feature for sound processing.
We will use MFCC and its first and second differences (like discrete derivatives) as our features
for classfication. First, use function feature.mfcc() from LibROSA to extract MFCC features
from each audio sample. (The first MFCC coefficient is typically discarded in sound analysis,
but you do not need to. You can test whether this helps in the optional problem below.)
Next, use function feature.delta() to calculate the first and second differences of MFCC.
Finally, combine these features and normalize each feature to zero mean and unit variance.
3. Train a linear multiclass SVM on your 2000 example training set. Evaluate your results on
the validation set in terms of 0/1 error and generate a confusion table. Compare the results
to a one-vs-all classifier using a binary linear SVM as the base classifier. For each model, may
use your code from the previous problem, or you may use another implementation (e.g. from
sklearn).
4. [More Optional] Compare results to any other multiclass classification methods of your choice.
5. [More Optional] Try different feature sets and see how they affect performance.
6