Description
1 Written Questions
Answer the following questions in the template provided. Then upload your solutions to Gradescope. You
may use LATEX or print the template and hand-write your answers then scan it in. Failure to use the template
may result in a penalty.
1.1 Reductions to Binary Classification
1. (1 point) Can we construct an ECOC matrix that yields the same algorithm as One vs. All classification? If yes, describe the matrix; if not, explain why not.
2. (1 point) Can we construct an ECOC matrix that yields the same algorithm as All vs. All classification?
If yes, describe the matrix; if not, explain why not.
3. (1 point) True or False: One-against-some classifier has logarithmic runtime in the number of examples.
True
False
4. (1 point) True or False: Extreme classification assumes a large label set.
True
False
5. (2 points) For a full binary classification tree of depth d, what is the probability of making a correct
classification on X~ . Assume that the probability of a binary classifier at any node making an error is .
(A full binary tree is one in which every node, except the leaves, has two children.)
6. Numerical answer: For k-way multi-classification with k = 16, how many classifiers need to be
constructed for…
(a) (1 point) …One vs. All classification?
2 of 16
Homework 1: Learning to Search 10-418 / 10-618
(b) (1 point) …All vs. All classification?
7. (1 point) For an arbitrary classification tree with k classes, what is the minimum number of binary
classifiers that could be needed?
8. (1 point) For an arbitrary classification tree with k classes, What is the maximum number of classifiers
that could be needed? (Assume that the sets of classes for each pair of siblings are mutually exclusive.)
3 of 16
Homework 1: Learning to Search 10-418 / 10-618
1.2 Learning to Search
In this section, we’ll consider a simple version of learning to search for the task of identifying names in a
text. In this setting our action space will be the set {+, −} where + denotes labeling a word as part of a
name and − denotes labeling a word as not being part of a name. Our state space will be represented by
Sz where z is a vector denoting the labeling of our sentence so far. For example, if we are in state S++−
it means we have labeled word 0 as +, word 1 as +, word 2 as −, and we are currently attempting to label
word 3.
Throughout this section, we will be referring exclusively to the following input sentence ~x and oracle labeling ~y:
In Figure 1.1 you can see a small part of the search space for this problem.
Figure 1.1
1. (1 point) How many nodes are in the search space for sentence ~x?
2. (1 point) How many nodes are in the search space for a sentence of length T?
4 of 16
Homework 1: Learning to Search 10-418 / 10-618
3. Suppose our loss function is Hamming loss.
(a) (1 point) What does the oracle policy return for S++−?
(b) (1 point) What does the oracle policy return for S−−−?
4. Suppose our loss function is Hamming loss with an additional error for including only part of a name,
e.g. omitting a person’s title. More precisely,
c(y, yˆ) = ”
T
X−1
t=0
I{yt 6= ˆyt}
#
+
”
T
X−1
t=1
I{yt−1 = +, yˆt−1 = −, yt = +, yˆt = +}
#
(a) (1 point) What does the oracle policy return for S−?
(b) (1 point) What does the oracle policy return for S+?
5. Suppose that our scoring function is of the form θ
T
f(xi
, yi) where
f(xi
, yi) =[I{xi starts with a capital letter and yi = +}
I{xi starts with a capital letter and yi = −},
I{ is in our gazetteer list and yi = +},
I{ is in our gazetteer list and yi = −},
I{yi−1 = −, yi = −}]
and θ is a vector of length 5. A gazetteer list is essentially a lookup table of entities that we know are
names. Our gazetteer list will include { Xavier, Mellon }.
Assume our initial weight vector θ = [−1, 1, −1, 1, 0].
5 of 16
Homework 1: Learning to Search 10-418 / 10-618
(a) (1 point) What score would our scoring function assign to the ground truth labeling?
(b) (1 point) What labeling would the greedy policy induced by this scoring function return? Break
any ties with +.
(c) (1 point) Suppose we use the Supervised Approach to Imitation Learning. Which (state, action)
pairs would be produced by the learning algorithm? Denote the action by either + or −. Denote
a state by the partial sequence it corresponds to (e.g. the state +− corresponds to a state that took
action + followed by action −).
(d) (1 point) Suppose we use DAgger, Which (state, action) pairs would be produced by the learning
algorithm? Use the same denotation of states and actions as in the previous question.
6 of 16
Homework 1: Learning to Search 10-418 / 10-618
We decide to train our linear model using the Perceptron update rule. That is, if the classifier (aka. greedy
policy) makes a positive mistake (i.e. a mistake where yi = +) in its action selection, then we add the
feature vector to the weights. If it makes a negative mistake (i.e. a mistake where yi = −), then we subtract
the feature vector from the weights. We treat the arrival of each (state, action) pair as a separate online
example for the classifier.
(e) (1 point) Using the (state, action) pairs produced by the Supervised Approach to Imitation Learning, what would the new value of θ be after completing the corresponding Perceptron updates?
(f) (1 point) Using the (state, action) pairs produced by DAgger, what would the new value of θ be
after completing the corresponding Perceptron updates?
7 of 16
Homework 1: Learning to Search 10-418 / 10-618
1.3 Recurrent Neural Network Language Models
In this section, we wish to use an Elman Network as a building block to design an RNN language model.
Below we define the Elman network recursively.
bt = relu(B
Tbt−1 + AT at + d)
ct = softmax(CTbt + e)
where at
, ∀t is given as input to the network; A, B, C, d, e are parameters of the network; and the initial
hidden units b0 are also treated as parameters. In this problem, we assume at
, bt
, ct ∈ R
2
for all t, i.e. all
vectors have length two, and that the parameters matrices and vectors are appropriately defined to preserve
those dimensions. Above, the function relu(·) is the Rectified Linear Unit function applied elementwise to
its vector-valued parameter. The function softmax(·) is likewise vector-valued and defined below. We use
[·]i
to explicitly denote the ith element of its argument.
[relu(v)]i , max(0, vi)
[softmax(v)]i ,
exp(vi)
PM
j=1 exp(vj )
Figure 1.2 depicts the Elman Network. The parameters are shown in red, but their connections into the
computation graph are not shown.
B
A
C
d b0 b1
a1
c1
b2
a2
c2
b3
a3
c3
…
e
Figure 1.2
Assume that we wish to build a language model p(y) of binary sequences y ∈ {+, −}L of fixed length L.
We have training data consisting of sequences D = {y
(1)
, . . . , y
(N)}, , where |y
(i)
| = L. Assume further
that we have pre-encoded the data as sequences of one-hot vectors D0 = {z
(1)
, . . . , z
(N)} such that:
if y
(i)
t = +, then z
(i)
t = [1, 0]T
,
if y
(i)
t = −, then z
(i)
t = [0, 1]T
.
For such a pair of vectors, we write that z
(i) = one-hot(y
(i)
)
8 of 16
Homework 1: Learning to Search 10-418 / 10-618
1. (1 point) Short answer: Since we wish to treat this Elman Network as an RNN-LM, how many inputs
at will we need for a single training example y ∈ D with |y| = L? Explain your answer.
2. (1 point) Select one: If we decide to train this RNN-LM with Teacher Forcing, we will need to compute
a loss function for an input example y ∈ D. Assuming so, how would we define at? Note: Be sure to
account for all t in your definition.
3. (1 point) Select one: If we decide to train this RNN-LM with Scheduled Sampling, we will need to
compute a loss function for an input example y ∈ D. Assuming we use a schedule that always selects
the model policy with probability 1, how would we define at? Note: Be sure to account for all t in your
definition.
4. (1 point) Write the cross entropy loss ` for a single training example z ∈ D0
in terms of the units and/or
parameters of the RNN-LM: at
, bt
, ct
, A, B, C, d, e.
Suppose we have parameter values as defined below:
A =
1 1
1 1
B =
0.5 0.5
0.5 0.5
C =
1 1
2 1
(1.1)
d =
0
0
e =
0
0
b0 =
0
0
(1.2)
5. Numerical answer: When computing the probability of the sequence y = [+, −, +], what is the value
of the following three quantities? Note: Round each numerical value to two significant figures.
(a) (1 point) b1,1 =
(b) (1 point) b2,1 =
9 of 16
Homework 1: Learning to Search 10-418 / 10-618
6. Numerical answer: When computing the probability of the sequence y = [+, −, +], what is the value
of the following three quantities? Note: Round each numerical value to two significant figures.
(a) (1 point) c1,1 =
(b) (1 point) c2,1 =
7. (1 point) Numerical answer: What is the probability of the sequence y = [+, −, +] according to this
RNNLM? Note: Round the numerical value to two significant figures.
p(y) =
8. (1 point) Numerical answer: What is the probability of the (length one!) sequence y
0 = [−] according
to this RNNLM? Note: Round the numerical value to two significant figures.
p(y
0
) =
10 of 16
Homework 1: Learning to Search 10-418 / 10-618
1.4 Empirical Questions
The following questions should be completed after you work through the programming portion of this
assignment (Section 2).
1. (10 points) Record your model’s performance on the test set after 10 epochs in terms of Cross Entropy
(CE) and Character Error Rate (CER) when trained with the following schemes. Note: Round each
numerical value to two significant figures.
Schedule CE CER
All Oracle
All Model
β = 0.75
Linear Decay
Exponential Decay
2. (10 points) Plot training and testing cross entropy curves for three different training procedures: All
Oracle, All Model, and the fixed β = 0.75 training procedure. Let the x-axis ranges over 10 epochs.
Note: Your plot must be machine generated.
11 of 16
Homework 1: Learning to Search 10-418 / 10-618
3. In class we saw that we can prove a no-regret bound for sequences of β such that
limn→∞
1
N
N
X−1
i=0
βi = 0
.
(a) (1 point) Show that a fixed β does not satisfy this condition.
(b) (1 point) Show that exponential decay does satisfy this condition.
(c) (1 point) Did this theoretical difference make a difference in practice? Briefly explain why or why
not with respect to this dataset and problem setting.
1.5 Wrap-up Questions
1. (1 point) Multiple Choice: Did you correctly submit your code to Autolab?
Yes
No
2. (1 point) Numerical answer: How many hours did you spend on this assignment?.
12 of 16
Homework 1: Learning to Search 10-418 / 10-618
2 Programming
Your goal in this assignment is to implement a deep learning model for acoustic speech recognition (ASR).
You will implement a function to encode speech data into a fixed length vector, a function to decode this
fixed length vector into a sequence of characters, and a variety of strategies to train the overall model.
Your solution for this section must be implemented in PyTorch using the data files we have provided to you.
This restriction is because we will be grading your code by hand to check your understanding as well as
your model’s performance.
2.1 Task Background
Acoustic speech recognition is the problem of taking in recordings of people talking and predicting what
was said. While many successful models try to predict linguistically meaningfully units like phonemes, we
will be directly predicting characters from waveforms for simplicity.
Though we will train our model with a standard classification loss (Cross-Entropy), what we really care
about is the accuracy of our transcription. For a given target sentence, we define the Character Error
Rate as the number of character deletions (d), character insertions (i), and character substitutions (s) need
to transform the output transcription to the goal transcription over the number of characters in the output
transcription (n).
CER =
(i + s + d)
n
(2.1)
Conveniently, this equation can be calculated with the following snippet of code, which runs a dynamic
programming algorithm to compute edit distance:
from nltk.metrics.distance import edit_distance
cer = edit_distance(our_string, goal_string) / len(our_string)
2.2 Data
In order to reduce your workload and computational requirements, we have preprocessed a subset of the
TIMIT dataset for you to use in this task.
The data is divided into two folders, train and test. In each folder is a sequence of pairs of numpy files (.npy)
and text files (.txt). If a numpy file is named ”xyz.npy”, it’s corresponding transcription can be found in
”xyz.txt”.
Given Input: Preprocessed audio files in numpy array of size 20 × L, where L is the number of timesteps
in the audio file.
Goal Output: Preprocessed text files of the transcribed audio.
For this section’s points, you will need to implement data loaders for PyTorch so that we can efficiently
train our model batch by batch. Note that because we are working with sequences, we will need all
sequences in a batch to have the same length.
(Hint: PyTorch allows you to pass a ”collate fn” to torch.utils.data.DataLoader and provides a function
called ”pad sequence” in torch.nn.utils.rnn)
13 of 16
Homework 1: Learning to Search 10-418 / 10-618
2.3 Seq2Seq Implementation
A sequence to sequence model (commonly abbreviated Seq2Seq) is a model that takes in a sequence of
input, like words in an English sentence or samples of a waveform, and outputs another sequence, like
words in Arabic or transcribed phonemes. Models of this type are frequently used in machine translation,
text to speech applications, and automatic speech recognition.
Seq2Seq models comprise of two parts, an encoder and a decoder. The encoder takes in a sequence of inputs
and outputs an encoding that represents all of the information contained in the input. The decode then takes
in this encoding and outputs a sequence in the new domain. This process can be seen in the figure below.
Figure 2.1: The encoder and decoder are trained jointly in a Seq2Seq model.
For this section, you must implement a working Seq2Seq model. Your implementation must have both
the encoder and decoder as single-layer LSTMs with 50% dropout applied to the input layer to the encoder.
Every hidden dimension in your neural networks should be 128 and your embedding size should be 256.
Set your optimizer to be Adam with the default PyTorch parameters. Your program should be able to run
quickly and easily on a laptop due to the simplicity of our model and the limited size of our data.
2.4 DAgger Implementation
As we’ve seen in class, DAgger is an algorithm for collecting training examples for our model by sometimes
following the actions performed by an expert and sometimes following our model’s decisions.
For this section’s points, you will need to implement DAgger as described in Algorithm 1. Note that this
algorithm allows βi
to vary with timestep i. This allows us to explore various schedules for the β parameter,
including some that don’t have the theoretical guarantees discussed in class.
14 of 16
Homework 1: Learning to Search 10-418 / 10-618
Please implement a general version of DAgger and then the code necessary to run DAgger with (1) a fixed β,
(2) a linearly decaying β where β is decreased by 0.05 after each epoch, and (3) an exponentially decaying
β, where β = exp −1 × i where i is the current epoch.
Algorithm 1 DAgger
Initialize D ← ∅.
Initialize πˆ1 to any policy in Π.
for i=1 to N do
Let πi = βiπ
∗ + (1 − βi)ˆπi
Sample T-step trajectories using πi
.
Get dataset Di = {s, π∗
(s)} of visited states by πi and actions given by the expert.
Aggregate datasets: D ← D ∪ Di
.
Train classifier πˆi+1 on D.
end for
return best πˆi on validation.
2.5 Autolab Submission
You must submit a .tar file named seq2seq.tar containing seq2seq.py, which contains all of your
code.
You can create that file by running:
tar -cvf seq2seq.tar seq2seq.py
from the directory containing your code.
Some additional tips: DO NOT compress your files; you are just creating a tarball. Do not use tar -czvf.
DO NOT put the above files in a folder and then tar the folder. Autolab is case sensitive, so observe that all
your files should be named in lowercase. You must submit this file to the corresponding homework link on
Autolab.
Your code will not be autograded on Autolab. Instead, we will grade your code by hand; that is, we will
read through your code in order to grade it. As such, please carefully identify major sections of the code via
comments.
15 of 16
Homework 1: Learning to Search 10-418 / 10-618
3 Collaboration Policy
After you have completed all other components of this assignment, report your answers to the collaboration
policy questions detailed in the Academic Integrity Policies for this course.
1. Did you receive any help whatsoever from anyone in solving this assignment? If so, include full
details including names of people who helped you and the exact nature of help you received.
2. Did you give any help whatsoever to anyone in solving this assignment? If so, include full details
including names of people you helped and the exact nature of help you offered.
3. Did you find or come across code that implements any part of this assignment? If so, include full
details including the source of the code and how you used it in the assignment.
16 of 16