Description
COMS 4771 HW4
1 From distances to embeddings
Your friend from overseas is visiting you and asks you the geographical locations of popular US
cities on a map. Not having access to a US map, you realize that you cannot provide your friend
accurate information.
You recall that you have access to the relative distances between nine popular
US cities, given by the following distance matrix D:
Distances (D) BOS NYC DC MIA CHI SEA SF LA DEN
BOS 0 206 429 1504 963 2976 3095 2979 1949
NYC 206 0 233 1308 802 2815 2934 2786 1771
DC 429 233 0 1075 671 2684 2799 2631 1616
MIA 1504 1308 1075 0 1329 3273 3053 2687 2037
CHI 963 802 671 1329 0 2013 2142 2054 996
SEA 2976 2815 2684 3273 2013 0 808 1131 1307
SF 3095 2934 2799 3053 2142 808 0 379 1235
LA 2979 2786 2631 2687 2054 1131 379 0 1059
DEN 1949 1771 1616 2037 996 1307 1235 1059 0
Being a machine learning student, you believe that it may be possible to infer the locations of
these cities from the distance data. To find an embedding of these nine cities on a two dimensional
map, you decide to solve it as an optimization problem as follows.
You associate a two-dimensional variable xi as the unknown latitude and the longitude value for
each of the nine cities (that is, x1 is the lat/lon value for BOS, x2 is the lat/lon value for NYC, etc.).
You write down the an (unconstrained) optimization problem
minimizex1,…,x9
X
i,j
kxi xjk Dij 2
,
where P
i,j (kxi xjk Dij )2 denotes the embedding discrepancy function.
(i) What is the derivative of the discrepancy function with respect to a location xi?
(ii) Write a program in your preferred language to find an optimal setting of locations x1,…,x9.
You must submit your code to receive full credit.
(iii) Plot the result of the optimization showing the estimated locations of the nine cities. (here is
a sample code to plot the city locations in Matlab)
>> cities={’BOS’,’NYC’,’DC’,’MIA’,’CHI’,’SEA’,’SF’,’LA’,’DEN’};
>> locs = [x1;x2;x3;x4;x5;x6;x7;x8;x9];
>> figure; text(locs(:,1), locs(:,2), cities);
What can you say about your result of the estimated locations compared to the actual geographical locations of these cities?
2 Kernelized SVM vs Neural Networks: Theory and Empirics
In this question, you will get a glimpse into some of the similarities and differences of kernelized
SVMs (KSVM) and neural networks (NN) using both theory and empirics.
Understanding KSVMs for Regression: In lecture we covered kernelized SVMs for classification,
but here we will consider a formulation for regression. For classification, we aimed to find the
maximum margin decision boundary between our two classes.
For regression, we can first study
the linear case with one dimensional inputs and outputs. We aim to learn a predictor of the form
f(x) = wx + b given a set of observations (xi, yi) for i 2 [N]. We now want to constrain the
possible choices of our parameters, namely we want 8i 2 [N] : |yi (wxi + b)| ✏, where ✏ is a
hyperparameter. This ensures that all our residuals are within ✏ distance.
(a) Assuming there exists such a predictor, we can formulate our optimization problem as follows:
min
w,b
1
2
w2
such that: 8i 2 [N] : |yi (wxi + b)| ✏.
Notice that the constraints are making sure that our prediction is within distance of ✏. What
role does the objective play in this optimization?
(Hint: consider the formulation of ridge regression)
(b) Of course, it is not guaranteed that all predictions will be within ✏ distance. Thus, we can add
in slack variables as follows:
min
w,b,⇠,
1
2
w2 + CX
N
i=1
(⇠i + i)
such that: yi (wxi + b) ✏ + ⇠i 8i 2 [N]
(wxi + b) yi ✏ + i
⇠i 0
i 0.
Describe in words the role of ⇠i and i.
(c) Show that the learned predictor will take the form f(x) = PN
i=1(↵i i)xix + b by setting
up the Lagrangian and examining the stationary points with respect to w.
So far our SVM formulation was for linear regression, where the prediction is done by
f(x) = X
N
i=1
(↵i i)xix + b.
Note that by taking i = ↵i i, and replacing the (inner) product between xi and x by any
non-linear kernel, we can now have a kernelized version as
fkern(x) = X
N
i=1
iK(xi, x) + b.
For the remainder of the question we will be focusing on the RBF kernel.
Approximation Power of RBF KSVMs: RBF kernels are known to fit any arbitrary complex
functions. Here, we will show specifically that our KSVM model can approximate any continuous
function over the interval [1, 1] with arbitrary precision1.
Let Z be the set of all possible RBF KSVM regressors of the type [1, 1] ! R on a non-empty
dataset, that is, Z := {fkern}.
(d) Prove that Z is an algebra, that is, it is closed under: (i) addition (8f1, f2 2 Z : (f1+f2) 2 Z),
(ii) multiplication (8f1, f2 2 Z : f1f2 2 Z, i.e. element-wise multiplication), and (iii) scalar
multiplication (8f1 2 Z : 8c 2 R : cf1 2 Z).
For multiplication, you may assume = 1 as the scaling parameter of the RBF kernel to
simplify calculations.
(e) Prove that Z can “isolate” each point in [1, 1], that is, for a fixed dataset, there exists distinct
x, y 2 [1, 1] : 9f 2 Z : f(x) 6= f(y).
(f) Prove that 8x 2 [1, 1] : 9f 2 Z : f(x) 6= 0.
Parts d-f collectively imply RBF KSVMs can approximate any continuous function on the interval
[1, 1]. 2
Approximation Power of 2-Layer Neural Networks: A 2-layer neural network (with one-dimensional
input, one-dimensional output, and an N-width hidden layer) is defined as
fNN(x) := X
N
i=1
↵i(wix + bi),
where wi, bi and ↵i are the network weights, and is the “activation” function (usually a sigmoid,
ReLU, or tanh).
1The set of continuous functions on the interval [1, 1] is denoted C([1, 1]). 2One can combine these results, see e.g. Stone–Weierstrass Theorem, to show universal approximability.
3
silina)
Here we will show that ReLU activated 2-Layer Neural Networks, i.e. fNN can approximate any
function from [1, 1] ! R, by showing that ReLU activations are “discriminatory”.3
An activation function is called discriminatory if for a given µ 2 C([1, 1]),
⇣
8w, b 2 R :
Z 1
1
(wx + b)µ(x)dx = 0⌘
=) µ(x)=0.
(g) Prove that the linear activation function, that is, (x) = x is not discriminatory.
(Hint: think of a discrete case first.)
(h) Assuming that all continuous functions of the form (x) = (
1 as x ! 1
0 as x ! 1
are discriminatory, prove that ReLU is discriminatory.
(Hint: try to construct a type of function out of ReLUs and prove by contradiction.)
Comparing Empirical Performance of KSVMs and NNs: If both KSVMs and NNs have universal approximation (as seen in previous parts), then why are NNs more used in practice?
While both KSVMs and NNs are powerful models that can work well on arbitrarily complex datasets,
on simpler datasets, we want models that require fewer training samples to yield good prediction. Here we will compare the relative performance of KSVMs and NNs on increasingly complex
datasets and study which model class adapts better.4
Download the dataset provided. It contains train and test samples of various sizes of (x, y) pairs of
functions with increasing complexity. For this question, you may use any library you want.
(i) To get a feel for the data, create scatter plots (x vs. y) using 50 training samples and 1000
training samples for each function complexity. (There are a total of 10 plots.)
(j) Train an RBF KSVM regressor and a NN for each function complexity with varying number
of training samples.
Suggestions: For the SVM, it is advised you use SciKitLearn’s the built in Support Vector
Regression function. The default settings should work fine, just ensure that you specify the
right kernel. For the NN, it is advised you use PyTorch. Using a small neural network with 2
or 3 hidden layers and a dozen or a few dozen neurons per layer with the Adam optimizer and
ReLU activation function should work well. To squeeze out good performance, try changing
the number of epochs and the batch size first. Additionally, see this reference for more training
tips: http://karpathy.github.io/2019/04/25/recipe/. You must use the MSE loss. For training
sample sizes, consider sizes 50, 100, 300, 500, 750, and 1000.
You must submit your code to receive credit.
(k) For both KSVM and NN predictors, and each function complexity, plot the test MSE error for
varying training sizes.
3See e.g. “Approximation by Superpositions of a Sigmoidal Function” by Cybenko for a proof of why discriminatory
activation functions imply universal approximation. 4For a more theoretical analysis of this topic, please see https://francisbach.com/quest-for-adaptivity/ and references
therein.
(l) What can you conclude about the adapatibility of KSVMs vs NNs? Is one model better than
the other? Analyze how the prediction quality varies with function complexity and training
size.