Description

5/5 - (7 votes)

1 Hard-Coding Networks
Can we use neural networks to tackle coding problems? Yes! In this question, you will build a
neural network to find the k
th smallest number from a list using two different approaches: sorting
and counting (Optional). You will start by constructing a two-layer perceptron “Sort 2” to sort
two numbers and then use it as a building block to perform your favorite sorting algorithm (e.g.,
Bubble Sort, Merge Sort). Finally, you will output the k
th element from the sorted list as the final
answer.
Note: Before doing this problem, you need to have a basic understanding of the key components of
neural networks (e.g., weights, activation functions). The reading on multilayer perceptrons located
at https://uoft-csc413.github.io/2022/assets/readings/L02a.pdf may be useful.
1
https://markus.teach.cs.toronto.edu/2022-01/
2
https://uoft-csc413.github.io/2022/assets/misc/syllabus.pdf
3
https://piazza.com/class/ky8yug7i9ol3b1
1
CSC413/2516 Winter 2022 with Professor Jimmy Ba & Professor Bo Wang Homework 1
1.1 Sort two numbers [1pt]
In this problem, you need to find a set of weights and bias for a two-layer perceptron “Sort 2” that
sorts two numbers. The network takes a pair of numbers (x1, x2) as input and output a sorted pair
(y1, y2), where y1 ≤ y2. You may assume the two numbers are distinct and positive for simplicity.
You will use the following architecture:
Please specify the weights and activation functions for your network. Your answer should include:
• Two weight matrices: W(1)
, W(2) ∈ R
2×2
• Two bias vector: b
(1)
, b
(2) ∈ R
2
• Two activation functions: ϕ
(1)(z), ϕ
(2)(z)
You do not need to show your work.
Hints: Sorting two numbers is equivalent to finding the min and max of two numbers.
max(x1, x2) = 1
2
(x1 + x2) + 1
2
|x1 − x2|, min(x1, x2) = 1
2
(x1 + x2) −
1
2
|x1 − x2|
1.2 Perform Sort [1.5pt]
Draw a computation graph to show how to implement a sorting function ˆf : R
4 → R
4 where
ˆf(x1, x2, x3, x4) = (ˆx1, xˆ2, xˆ3, xˆ4) where (ˆx1, xˆ2, xˆ3, xˆ4) is (x1, x2, x3, x4) in sorted order. Let us
assume ˆx1 ≤ xˆ2 ≤ xˆ3 ≤ xˆ4 and x1, x2, x3, x4 are positive and distinct. Implement ˆf using your
favourite sorting algorithms (e.g. Bubble Sort, Merge Sort). Let us denote the “Sort 2” module as
S, please complete the following computation graph. Your answer does not need to give the label
for intermediate nodes, but make sure to index the “Sort 2” module.
Hints: Bubble Sort needs 6 “Sort 2” blocks, while Merge Sort needs 5 “Sort 2” blocks.
2
CSC413/2516 Winter 2022 with Professor Jimmy Ba & Professor Bo Wang Homework 1
1.3 Find the k
th smallest number [0pt]
Based on your sorting network, you may want to add a new layer to output your final result (k
th
smallest number). Please give the weight W(3) for this output layer when k = 3.
Hints: W(3) ∈ R
1×4
.
1.4 Counting Network [0pt]
The idea of using a counting network to find the k
th smallest number is to build a neural network
that can determine the rank of each number and output the number with the correct rank. Specifically, the counting network will count how many elements in a list are less than a value of interest.
And you will apply the counting network to all numbers in the given list to determine their rank.
Finally, you will use another layer to output the number with the correct rank.
The counting network has the following architecture, where y is the rank of x1 in a list containing
x1, x2, x3, x4.
Please specify the weights and activation functions for your counting network. Draw a diagram to
show how you will use the counting network and give a set of weights and biases for the final layer
to find the k
th smallest number. In other words, repeat the process of sections 1.1, 1.2, 1.3 using
the counting idea.
Hints: You may find the following two activation functions useful.
1) Hard threshold activation function:
ϕ(z) = I(z ≥ 0) =
1 if z ≥ 0
0 if z < 0
2) Indicator activation function:
ϕ(z) = I(z ∈ [−1, 1]) =
1 if z ∈ [−1, 1]
0 otherwise
2 Backpropagation
This question helps you to understand the underlying mechanism of back-propagation. You need
to have a clear understanding of what happens during the forward pass and backward pass and be
able to reason about the time complexity and space complexity of your neural network. Moreover,
you will learn a commonly used trick to compute the gradient norm efficiently without explicitly
writing down the whole Jacobian matrix.
Note: The reading on backpropagation located at https://uoft-csc413.github.io/2022/assets/
readings/L02b.pdf may be useful for this question.
3
CSC413/2516 Winter 2022 with Professor Jimmy Ba & Professor Bo Wang Homework 1
2.1 Automatic Differentiation
Consider a neural network defined with the following procedure:
z1 = W(1)x + b
(1)
h1 = ReLU(z1)
z2 = W(2)x + b
(2)
h2 = σ(z2)
g = h1 ◦ h2
y = W(3)g + W(4)x,
y
′ = softmax(y)
S =
X
N
k=1
I(t = k) log(y
′
k
)
J = −S
for input x with class label t where ReLU(z) = max(z, 0) denotes the ReLU activation function,
σ(z) = 1
1+e−z denotes the Sigmoid activation function, both applied elementwise, and softmax(y) =
exp(y)
PN
i=1 exp(yi)
. Here, ◦ denotes element-wise multiplication.
2.1.1 Computational Graph [0pt]
Draw the computation graph relating x, t, z1, z2, h1, h2 , g, y, y
′
, S and J .
2.1.2 Backward Pass [1pt]
Derive the backprop equations for computing x¯ =
∂J
∂x
⊤
, one variable at a time, similar to the
vectorized backward pass derived in Lec 2.
Hints: Be careful about the transpose and shape! Assume all vectors (including error vector) are column vector and all Jacobian matrices adopt numerator-layout notation 4
. You can use softmax′
(y)
for the Jacobian matrix of softmax.
2.2 Gradient Norm Computation
Many deep learning algorithms require you to compute the L
2 norm of the gradient of a loss function with respect to the model parameters for every example in a minibatch. Unfortunately, most
differentiation functionality provided by most software frameworks (Tensorflow, PyTorch) does not
support computing gradients for individual samples in a minibatch. Instead, they only give one
gradient per minibatch that aggregates individual gradients for you. A naive way to get the perexample gradient norm is to use a batch size of 1 and repeat the back-propagation N times, where
N is the minibatch size. After that, you can compute the L
2 norm of each gradient vector. As
you can imagine, this approach is very inefficient. It can not exploit the parallelism of minibatch
operations provided by the framework.
4Numerator-layout notation: https://en.wikipedia.org/wiki/Matrix_calculus#Numerator-layout_notation
4
CSC413/2516 Winter 2022 with Professor Jimmy Ba & Professor Bo Wang Homework 1
In this question, we will investigate a more efficient way to compute the per-example gradient
norm and reason about its complexity compared to the naive method. For simplicity, let us consider
the following two-layer neural network.
z = W(1)x
h = ReLU(z)
y = W(2)h,
where W(1) =


1 2 1
−2 1 0
1 −2 −1

 and W(2) =


−2 4 1
1 −2 −3
−3 4 6

.
2.2.1 Naive Computation [1pt]
Let us assume the input x =

1 3 1⊤
and the error vector y =
∂J
∂y
⊤
=

1 1 1⊤
. In this question, write down the Jacobian matrix (numerical value) ∂J
∂W(1) and ∂J
∂W(2) using back-propagation.
Then, compute the square of Frobenius Norm of the two Jacobian matrices, ∥A∥
2
F
. The square of
Frobenius norm of a matrix A is defined as follows:
∥A∥
2
F =
Xm
i=1
Xn
j=1
|aij |
2 = trace
A
⊤A

Hints: Be careful about the transpose. Show all your work for partial marks.
2.2.2 Efficient Computation [0.5pt]
Notice that weight Jacobian can be expressed as the outer product of the error vector and activation
∂J
∂W(1) = (zx⊤)
⊤ and ∂J
∂W(2) = (yh⊤)
⊤. We can compute the Jacobian norm more efficiently using
the following trick:
∥
∂J
∂W(1) ∥
2
F = trace
∂J
∂W(1)
⊤ ∂J
∂W(1)!
(Definition)
= trace
zx⊤xz
⊤

= trace
x
⊤xz
⊤z

(Cyclic Property of Trace)
=

x
⊤x
z
⊤z

(Scalar Multiplication)
= ∥x∥
2
2∥z∥
2
2
Compute the square of the Frobenius Norm of the two Jacobian matrices by plugging the value
into the above trick.
Hints: Verify the solution is the same as naive computation. Show all your work for partial marks.

CSC413/2516 Homework 1 Hard-Coding Networks

Description

Related products

CSC413/2516 Programming Assignment 4

CSC413/2516 Homework 2 Optimization

CSE 417A: Homework 1