CSC321 Programming Assignment 1: Loss Functions and Backprop


Category: You will Instantly receive a download link for .zip solution file upon Payment


5/5 - (2 votes)

This assignment is meant to get your feet wet with computing the gradients for a model using
backprop, and then translating your mathematical expressions into vectorized Python code. It’s
also meant to give you practice reasoning about the behavior of different loss functions.
We will consider a multilayer perceptron model with one hidden layer, similar to the ones from
r = W(1)x + b
(1) (1)
h = φ
(1)(r) (2)
z = w(2)>h + b
(2) (3)
y = φ
(2)(z) (4)
But we’ll explore another activation function we haven’t discussed yet, called softsign:
softsign(z) = z
1 + |z|
. (5)
This is another example of a sigmoidal function, somewhat similar to the logistic and tanh functions;
its range is [−1, 1]. We’ll consider softsign as the activation function for both the hidden units and
the output unit of the network, and see how it compares against alternatives.
In order to use softsign as our output activation function, we need to transform the output so
that the predictions are in [0, 1]:
s = softsign(z) (6)
y =
s + 1
. (7)
The task will be our running example from class — handwritten digit classification using the
MNIST dataset. We’ll use a subset of the data, and consider the binary classification task of
distinguishing 4’s and 9’s. To make things more interesting, the labels are noisy: 10% of the
training labels have been flipped.
CSC321 Winter 2017 Programming Assignment 1
Starter Code
The starter code contains four files:
•, the Python module which has full or partial implementations of all the models
considered in this assignment. This is the only file you will need to modify.
•, a file contining some helpful utilities (e.g. gradient checking). You don’t need to
use anything here directly, but you might read through it if you’re interested.
•, a Pickle file containing the training, validation, and test data.
– This contains the inputs and targets for the training set, validation set, and test set.
(We only use the training and validation data in this assignment.) For the training
targets, you have both the clean targets data[’t_train_clean’] and the noisy ones
data[’t_train_noisy’] (i.e. where 10% of the labels were flipped).
•, a partially trained model which we use to test your derivative calculations.
Before doing this assignment, you should read through In particular, observe a few
• This file includes two sets of classes: those implementing models, and those implementing
cost functions. The cost functions know how to compute their values and derivatives given
the predictions and targets, but are agnostic about how the predictions are made. Similarly,
the model classes know how to compute their predictions and backpropagate derivatives given
the loss derivatives. This provides modularity, since we can easily swap in different models
and cost functions.
• In addition to the model described above, we’ve also given you implementations of two activation/loss pairs from lecture (logistic activation with squared error, and logistic activation
with cross-entropy). We’ve also given you a linear model. You may like to refer to these
classes as a guide when doing your implementation.
• Even though the activation function for the output unit is technically part of the model, for
the implementation it is often more convenient to group it together with the loss function,
since that makes the computations simpler and more numerically stable. This is why the
model classes produce z rather than y, and the loss function classes take in z rather than y.
Notice how short the LogisticCrossEntropy implementation is!
• We are generous enough to provide you with a finite differences gradient checker. The function
check_derivatives checks your derivative calculations for both the model and the loss function. You might like to check out the implementations of the checking routines in
• The print_derivatives function is only there for marking purposes; more on this below.
• The train_model class trains the model using stochastic gradient descent.
CSC321 Winter 2017 Programming Assignment 1
Part 1: Implementation (5 points)
Your first task is to finish the implementation of the gradient computations for the multilayer
perceptron and softsign-with-cross-entropy loss function. In particular, you will need to write three
1. softsign_prime: this function should compute the derivative of the softsign function with
respect to its argument. It takes an array z and operates elementwise.
2. SoftsignCrossEntropy.derivatives: This class represents the softsign activation function
followed by cross-entropy loss. The method derivatives should compute the loss derivatives
with respect to z; it takes an array and operates elementwise.
3. MultilayerPerceptron.cost_derivatives: This class implements the MLP model; you
need to implement the backprop calculations for the derivatives. The method cost_derivatives
takes in a batch of inputs X, a dict of activations (see the method compute_activations),
and a vector dLdz containing the loss derivative with respect to z for each training example.
In order for us to evaluate the correctness of your implementation, please provide the output
of the function models.print_derivatives(), which prints out certain entries of the derivative
matrices and vectors. You must also submit your completed
In order to make your life easier, we have provided you with the function models.check_derivatives,
which checks your derivatives using finite differences. Since we are giving you a gradient checker,
you have no excuse for not getting full marks on this part!
Part 2: Conceptual Questions (5 points)
1. [2 points] Recall our discussion from Lecture 4 about how the logistic activation function
with squared error loss has a weak gradient signal when the model’s predictions are very
wrong, whereas the logistic activation function with cross-entropy loss has a strong gradient
Plot all three activation-function-loss functions as a function of z for the target t = 1, with z
ranging from [−5, 5]. Based on this plot, which of the loss functions do you think would cause
the training loss to decrease the fastest if you start from a very bad initialization? Which
would cause it to decrease the slowest?
Now train MLPs with each of the three loss functions, using the function train_from_scratch,
but setting the initial standard deviation of the parameters to 10. Do the results support your
prediction? You may want to repeat each run several times, since there is some variability
between runs.
2. [2 points] Now let’s consider training with the noisy labels; you can enable this by passing
the flag noisy=True to train_from_scratch. Different loss functions have differing degrees
of robustness to mislabeled examples. Based on your plot of the cost functions, which one
do you think will get the lowest training error, and which one will get the highest training
error? (I.e., which loss function will try hardest to nail every training example, including the
mislabeled ones, and which one is most willing to give up?)
Which one do you think will get the lowest validation error? (Note: there isn’t a clear-cut
answer to this question, so you might discuss some factors which would cause certain ones to
get higher or lower validation error.)
CSC321 Winter 2017 Programming Assignment 1
Now train MLPs with each of the three cost functions and see what happens. Were your
predictions correct? (Use the default standard deviation of 0.1 for the initialization, rather
than 10.) You may want to repeat each run several times, since there is some variability
between runs.
3. [1 point] There’s a separate optimization-related issue: sometimes hidden units can saturate
when their activations are very close to 0 or 1. To see why this happens, look at the backprop
equations you derived for your implementation; they include the derivative of the activation
function for the hidden units. When the inputs rk to the hidden unit activation function are
large in magnitude, you land in the flat region of the activation function, so the derivatives
are small in magnitude.
Let’s say you have a choice between using the logistic or softsign nonlinearity for the hidden
unit activations. Which one would be preferable from the perspective of getting saturated
units unstuck? I.e., if the input rk is large in magnitude, which activation function will result
in a stronger gradient signal?
You do not need to run any experiments for this part.
What you need to submit
• A PDF file a1-writeup.pdf, typeset using LATEX, containing your answers to the conceptual
questions as well as the output of print_derivatives.
• Your completed implementation of