Description
1 Hard-Coding Networks
The reading on multilayer perceptrons located at https://csc413-2020.github.io/assets/readings/
L02a.pdf may be useful for this question.
1.1 Verify Sort [1pt]
In this problem, you need to find a set of weights and biases for a multilayer perceptron which
determines if a list of length 4 is in sorted order. More specifically, you receive four inputs x1, . . . , x4,
where xi ∈ R, and the network must output 1 if x1 ≤ x2 ≤ x3 ≤ x4, and 0 otherwise. You will use
the following architecture:
All of the hidden units and the output unit use a hard threshold activation function:
φ(z) = I(z ≥ 0) =
1 if z ≥ 0
0 if z < 0
Please give a set of weights and biases for the network which correctly implements this function
(including cases where some of the inputs are equal). Your answer should include:
• A 3 × 4 weight matrix W(1) for the hidden layer
• A 3-dimensional vector of biases b
(1) for the hidden layer
• A 3-dimensional weight vector w(2) for the output layer
1
https://markus.teach.cs.toronto.edu/csc413-2020-01
2
https://csc413-2020.github.io/assets/misc/syllabus.pdf
3
https://piazza.com/class/k58ktbdnt0h1wx?cid=1
1
CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 1
• A scalar bias b
(2) for the output layer
You do not need to show your work.
1.2 Perform Sort [1pt]
Describe how to implement a sorting function ˆf : R
4 → R
4 where ˆf(x1, x2, x3, x4) = (ˆx1, xˆ2, xˆ3, xˆ4)
where (ˆx1, xˆ2, xˆ3, xˆ4) is (x1, x2, x3, x4) in sorted order. In other words, ˆx1 ≤ xˆ2 ≤ xˆ3 ≤ xˆ4, and each
xˆi
is a distinct xj . Implement ˆf using a feedforward or recurrent neural network with elementwise
activations. Do not explicitly give the weight matrices or biases for the entire function. Instead,
describe how to compose smaller, modular networks. You may multiply the outputs of two nodes
if you wish – ex., some hidden unit activation by the output of a thresholding function.
Hint: There are multiple solutions. You could brute-force the answer by using copies of Verify
Sort on permutations of the input, or you could implement a more scalable sorting algorithm where
each hidden layer i is the algorithms state at step i.
1.3 Universal Approximation Theorem [1pt]
We are going to build an intuition behind a simple Universal Approximation theorem, which shows
that some class of function approximators can approximate a particular class of functions arbitrarily
well.
In the reading we saw that neural networks can be universal approximators on binary functions,
because with fixed input dimension there is a finite number of potential inputs and a network can
memorize a different output for each input. But, what can we do if we have an uncountably infinite
set of potential inputs like R? Here, our class of function approximators will be neural networks
with a single hidden layer with a threshold function as the activation, and fixed choice of some
concave f.
Suppose f : I → R, where I = [a, b] ⊂ R and a ≤ b is a closed interval. Also, let ˆfτ : I → R be
some function approximator from our network where τ is a description of our networks architecture
and weights. Here, τ is a tuple of (n,W0 ∈ R
n×1
, b0 ∈ R
n
,W1 ∈ R
1×n
, b
1 ∈ R), where n is the
hidden layer size, W0 & b0 describe the input to hidden parameters, and W1 & b1 describe the
hidden to output parameters. This is visualized in Figure 1.
The difference between our functions is defined as kf − ˆfτ k =
R
I
|f(x)− ˆfτ (x)| dx. Our activation
is an indicator function a(y) = I(y ≥ 0), where I(s) is 1 when the boolean value s is true and 0
otherwise. The output is computed as ˆfτ (x) = W1a(W0x + b0) + b1. Here, applying a to a vector
means a(x) = [a(x1), a(x2), . . . , a(xn)].
We want to show that there exist a series of neural networks {τi}
N
i=1 such that:
∀ > 0, ∃M : ∀m > M, kf − ˆfτmk < (1)
1.3.1
Consider a bump function g(h, a, b, x) = h·I(a ≤ x ≤ b) visualized in Figure 2. Given some (h, a, b)
show ∃τ :
ˆfτ (x) = g(h, a, b, x). You answer should be a specific choice of n, W0, b0, W1, and b1,
which will be functions of the selected (h, a, b), where h ∈ R, a ∈ R, and b ∈ R.
1.3.2
Given f(x) = −x
2 + 1 where I = [−1, 1] and some initial function ˆf0(x) = 0 which is identically 0,
construct a new function ˆf1(x) = ˆf0(x) + g(h1, a1, b1, x) such that kf − ˆf1k < kf − ˆf0k, with the g
2
CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 1
.
.
.
x
ˆfτ (x)
Input (W0,b0)
Hidden
Layer (W1,b1) Output
Figure 1: A neural network that has an input x ∈ R with a single hidden layer that has n units, and
an output ˆfτ (x) ∈ R. The network description is τ = (n,W0 ∈ R
n×1
, b0 ∈ R
n
,W1 ∈ R
1×n
, b
1 ∈
R).
0
a b
h
Figure 2: A bump function g(h, a, b, x) is shown in red as a function of x, for some choice of (h, a, b).
defined in 1.3.1. Note that h1, a1, and b1 are going to depend on our choice of f,
ˆf0 and I. Plot f
& ˆf1, write down h1, a1, and b1, and justify why kf − ˆf1k < kf − ˆf0k.
1.3.3
Describe a procedure which starts with ˆf0(x) = 0 and a fixed N then construct a series {
ˆfi}
N
i=0
where ˆfi+1(x) = ˆfi(x)+g(hi+1, ai+1, bi+1, x) which satisfies kf − ˆfi+1k < kf − ˆfik. Use the definition
of g from 1.3.1 and the choice of f from 1.3.2. Plot f, ˆf1,
ˆf2, & ˆf3, write down how to generate
hi+1, ai+1, bi+1, and justify why kf − ˆfi+1k < kf − ˆfik.
1.3.4
Not for marks – do not submit. Describe how to recover τi from your ˆfi
. Generalize your series
to work for {
ˆfi}∞
i=0, and show that limm→∞ kf − ˆfτmk = c. We need c = 0 for arbitrarily good
approximations – does your solution for {
ˆfi}∞
i=0 force c = 0? Can you generalize your procedure to
work for any concave f? What about an f which isn’t unimodal, like a piecewise continuous f?
3
CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 1
2 Backprop
The reading on backpropagation located at https://csc413-2020.github.io/assets/readings/
L02b.pdf may be useful for this question.
2.1 Computational Graph [1pt]
Consider a neural network with N input units, N output units, and K hidden units. The activations
are computed as follows:
z = W(1)x + b
(1)
h = ReLU(z)
y = x + W(2)h + b
(2)
,
y
0 = softmax(y),
where ReLU(z) = max(z, 0) denotes the ReLU activation function, applied elementwise, and
softmax(y) = exp(y)
PM
i=1 exp(yi)
.
The cost will involve both h and y
0
:
J = R − S
R = r
>h
S =
X
N
k=1
I(t = k)y
0
k
for given vectors r and class label t with input x.
2.1.1
Draw the computation graph relating x, t, z, h, y, y
0
, r, R, S, and J .
2.1.2
Derive the backprop equations for computing x¯ =
∂J
∂x
. You may use softmax0
to denote the
derivative of the softmax function (so you don’t need to write it out explicitly).
2.2 Vector-Jacobian Products (VJPs) [1pt]
Consider the function f : R
n → R
n where f(x) = vvT x, and v ∈ R
n×1 and x ∈ R
n×1
. Here, we
will explore the relative costs of evaluating Jacobians and vector-Jacobian products. We denote
the Jacobian of f with respect to x as J ∈ R
n×n
.
2.2.1
Compute J as defined in 2.2 for n = 3 and v
T = [1, 2, 3] – i.e, write down the values in J =
j1,1 j1,2 · · · j1,n
j2,1 j2,2 · · · j2,n
.
.
.
.
.
.
.
.
.
.
.
.
jn,1 jn,2 · · · jn,n
.
4
CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 1
2.2.2
What is the time and memory cost of evaluating the Jacobian of function f in terms of n?
2.2.3
Describe how to evaluate J
T y where y ∈ R
n with a time and memory cost that is linear in n,
where J is defined as in 2.2. Then, compute z = J
T y where v
T = [1, 2, 3] and y
T = [1, 1, 1] – i.e.,
write down the entries in z
T = [z1, . . . , zn].
3 Linear Regression
The reading on linear regression located at https://csc413-2020.github.io/assets/readings/
L01a.pdf may be useful for this question.
Given n pairs of input data with d features and scalar label (xi
, ti) ∈ R
d × R, we wish to
find a linear model f(x) = wˆ
>x with wˆ ∈ R
d
that minimizes the squared error of prediction on
the training samples defined below. This is known as an empirical risk minimizer. For concise
notation, denote the data matrix X ∈ R
n×d and the corresponding label vector t ∈ R
n
. The
training objective is to minimize the following loss:
min
wˆ
1
n
Xn
i=1
(wˆ
>xi − ti)
2 = min
wˆ
1
n
kXwˆ − t|
2
2
.
We assume X is full rank: X>X is invertible when n > d, and XX> is invertible otherwise.
Note that when d > n, the problem is underdetermined, i.e. there are less training samples than
parameters to be learned. This is analogous to learning an overparameterized model, which is
common when training of deep neural networks.
3.1 Deriving the Gradient [1pt]
Write down the gradient of the loss w.r.t. the learned parameter vector wˆ .
3.2 Underparameterized Model [1pt]
3.2.1
First consider the underparameterized d < n case. Write down the solution obtained by gradient
descent assuming training converges. Show your work. Is the solution unique?
Hint: Set the gradient derived in part 3.1 to 0 and manipulate the equality.
3.2.2
Assume that ground truth labels are generated by a linear target: ti = w∗>xi
. Show that the
solution in part 3.2.1 achieves perfect generalization when d < n, i.e. ∀x ∈ R
d
, (w∗>x−wˆ
>x)
2 = 0.
3.3 Overparameterized Model: 2D Example [1pt]
3.3.1
Now consider the overparameterized d > n case. We first illustrate that there exist multiple
empirical risk minimizers. For simplicity we let n = 1 and d = 2. Choose x1 = [2, 1] and t1 = 2,
5
CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 1
i.e. the one data point and all possible wˆ lie on a 2D plane. Show that there exists infinitely many
wˆ satisfying wˆ
>x1 = y1 on a real line. Write down the equation of the line.
3.3.2
We know that multiple empirical risk minimizers exist in overparameterized linear regression and
there is only one true solution. Thus, it seems unlikely that gradient descent will generalize if it
returns an arbitrary minimizer. However, we will show that gradient descent tends to find certain
solution with good properties. This phenomenon, know as implicit regularization, helps explain
the success in using gradient-based methods to train overparameterized models, like deep neural
networks.
First consider the 2-dimensional example in the previous part: starting from zero initialization
i.e. wˆ (0) = 0, what is the direction of the gradient? You should write down a unit-norm vector.
Does the direction change along the trajectory? Based on this geometric intuition, which solution
– along the line of solutions – does gradient descent find? Provide a pictorial sketch or a short
description of your reasoning.
3.3.3
Give a geometric argument that among all the solutions on the line, the gradient descent solution
from the previous part has the smallest Euclidean norm.
Hint: Use the Pythagorean Theorem.
3.4 Overparameterized Model: General Case [1pt]
3.4.1
Now we generalize the previous geometric insight developed to general d > n. Show that gradient
descent from zero initialization i.e. wˆ (0) = 0 finds a unique minimizer if it converges. Write down
the solution and show your work.
Hint: Recall the result on the direction of the gradient in the previous 2D example, which
suggests that the gradient vector is always spanned by the rows of X. What does this tell you
about the gradient descent solution?
3.4.2
Given the gradient descent solution from the previous part wˆ and another zero-loss solution wˆ 1,
evaluate (wˆ − wˆ 1)
>wˆ . Use this quantity to show that among all the empirical risk minimizers for
d > n, the gradient descent solution has the smallest Euclidean norm.
3.5 Benefit of Overparameterization
3.5.1
Visualize and compare underparameterized with overparameterized polynomial regression: https:
//colab.research.google.com/drive/1NCSHp-gkh1kGhcySImbJo6AqeKoz-HcR. Include your code
snippets for the fit_poly function in the write-up. Does overparameterization (higher degree polynomial) always lead to overfitting, i.e. larger test error?
6
CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 1
3.5.2
Not for marks. What are some potential benefits of the minimum-norm solution?
Hint: readings on SVM might be helpful: https://www.cs.toronto.edu/~urtasun/courses/
CSC411_Fall16/15_svm.pdf.
7