Description
Introduction
The goal of this coding assignment to get you familiar with TensorFlow and walk you through
some practical Deep Learning techniques. You will be starting with code that is similar
to one taught in the first NLP Deep Learning class. Recall, the code we taught in class
implemented a 3-layer neural network over the document vectors. The output layer classified
each document into (positive/negative) and (truthful/deceptive). You will utilize the dataset
from coding assignments 1 and 2. In this assignment, you will:
• Improve the Tokenization.
• Convert the first layer into an Embedding Layer, which makes the model somewhat
more interpretable. Many recent Machine Learning efforts strive to make models more
interpretable, but sometimes at the expense of prediction accuracy.
• Increase the generalization accuracy of the model by implementing input sparse dropout
– TensorFlow’s (dense) dropout layer does not work out-of-the-box, as explained later.
• Visualize the Learned Embeddings using t-SNE.
In order to start the assignment, please download the starter-code from:
• http://sami.haija.org/cs544/DL1/starter.py
You can run this code as:
python s t a r t e r . py path / to / coding1 /and/2/ data /
Note: This assignment will automatically be graded by a script, which verifies
the implementation of tasks one-by-one. It is important that you stick to these
guidelines: Only implement your code in places marked by ** TASK. Do not
change the signatures of the methods tagged ** TASK, or else the grading script
will fail in finding them and you will get a zero for the corresponding parts.
Otherwise, feel free to create as many helper functions as you wish!
Finally, you might find the first NLP Deep Learning lecture slides useful.
• This assignment is due Thursday April 4. We are working on Vocareum integration.
Nonetheless, you are advised to start early (before we finish the Vocareum Integration).
You can submit until April 7, but all submissions after April 4 will receive penalties.
1
Prepared By:
Sami & Ron
Deep Learning for NLP
Coding Assignment 1
CS 544
Due: Thu April 4
[10 points] Task 1: Improve Tokenization
The current tokenization:
# ** TASK 1.
d e f Tokenize ( comment ) :
“””Receives a string (comment) and returns array of tokens.”””
words = comment . s p l i t ( )
r e t u r n words
is crude. It splits on whitespaces only (spaces, tabs, new-lines). It leaves all other punctuations e.g. single- and double-quotes, exclamation marks, etc – there should be no reason to
have both terms “house” and “house?” in the vocabulary. While a perfect tokenization can
be quite involved, let us only slightly improve the existing one. Specifically, you should split
on any non-letter. You might find the python standard re package useful.
• Update code of Tokenize to work as described. Correct implementation should reduce
the number of tokens by about half.
2
Prepared By:
Sami & Ron
Deep Learning for NLP
Coding Assignment 1
CS 544
Due: Thu April 4
[20 + 6.5 points] Task 2: Convert the 1
st layer into an
embedding layer
Our goal here is to replace the first layer with something equivalent to tf.nn.embedding lookup,
followed by averaging, but without using the function tf.nn.embedding lookup as we aim
to understand the underlying mathematics behind embeddings and we do not (yet) want to
discuss variable-length representations in tensorflow1
.
The end-goal from this task is to make the output of this layer to represent every comment
(document) by the average word embedding appearing in the comment. For example, if we
represent the document by vector x ∈ R
|V |
, with |V | being the size of the vocabulary and
entry xi being the number of times word i appears in the document. Then, we would like
the output of the embedding layer for document x to be:
σ
x
>Y
||x||
(1)
where σ is element wise activation function. We wish to train the embedding matrix Y ∈
R
|V |×d which will embed each word in a d-dimensional space (each word embedding lives in
one row of the matrix Y ). The denominator ||x|| is to compute the average, which can be the
L1 or the L2 norm of the vector. In this exercise, use the L2 norm. The above should make
our model more interpretable. Note the following differences between the above embedding
layer and a traditional fully-connected (FC) layer, with transformation: σ
x
>W + b
.
1. FC layers have an additional bias-vector b. We do not want the bias vector. Its presence
makes the embeddings more tricky to be visualized or ported to other applications.
Here, W corresponds to the embedding dictionary Y .
2. As mentioned, the input vector b to Equation 1 should be normalized. if x is a matrix,
then normalization should be row-wise. (Hint: you can use tf.nn.l2 normalize).
3. Modern fully-connected layers are have σ = ReLu. Embeddings generally either have
(1) no activation or (2) a squashing activation (e.g. tanh, or L2-norm). We will opt to
use (2) specifically tanh activation, as option (1) might force us to choose an adaptive
learning-rate2
for the embedding layer.
4. The parameter W will be L2-regularized in standard FC i.e. by adding λ||W||2
2
to the
overall minimization objective function (where the scalar coefficient λ is generally set
to a small value such as 0.0001 or 0.00001). When training embeddings, we only want
to regularize the words that appear in the document rather than *all* embeddings at
every optimization update step. Specifically, we want to regularize by replacing the
standard L2 regularization with λ
x>Y
||x||
2
2
1Variable-length representations will likely be on next coding assignment
2Adaptive learning rates are incorporated in training algorithms such as AdaGrad and ADAM.
Prepared By:
Sami & Ron
Deep Learning for NLP
Coding Assignment 1
CS 544
Due: Thu April 4
In this task, you will represent the embedding transformation using the fully connected functionality. must edit the code of function FirstLayer. Here are your sub-tasks:
3 points Replace the ReLu activation with tanh.
4 points Remove the Bias vector.
7 points Replace the L2-regularization of fully connected with manual regularization. Specifically, tf.add loss on R(Y ), but choose the one in Bonus’ Part 1. Unlike the bonus
questions, here you will let TensorFlow determine the gradient and update rule. Hint:
tf.add loss and to the collection tf.GraphKeys.REGULARIZATION LOSSES.
4 points Preprocess the layer input by passing e.g. through l2 normalize x as in x := σ(x) where
σ(x) = x
||x||2
function σ.
2 points Add Batch Normalization.
6.5 points Bonus: Work-out the analytical expression of the gradient of the regularization R(Y )
with respect to the parameters Y . Provide a TensorFlow operator that carries the
update by-hand (without using automatic differentiation). The update should act as
Y := Y − η
∂R(Y )
∂Y , where η ∈ R+ is the learning rate. Zero credit will be given to all
solutions utilizing tf.gradients(). However, you are allowed to “test it locally” by
comparing your expression with the output of tf.gradients(), so long as you dont
call the function (in)directly from Embedding*Update functions below.
3 points Part i. if R(Y ) = λ
x>Y
||x||
2
2
. Implement it in EmbeddingL2RegularizationUpdate.
3.5 points Part ii. if R(Y ) = λ
x>Y
||x||
1
. Implement it in EmbeddingL1RegularizationUpdate.
– PLEASE Do not discuss the bonus questions on Piazza, with the TAs, or amongst
yourselvs. You must be the sole author of the implementation. However, you are
allowed to discuss them after you submit but with only those who submitted.
– Note: The functions Embedding*Update are not invoked in the code. That is
okay! Our grading script will invoke them to check for correctness.
The completion of the above tasks should successfully convert the first layer to an embedding
layer (Equation 1). Especially for the last sub-task, you might find the documentation useful:
tf.contrib.layers.fully connected
4
Prepared By:
Sami & Ron
Deep Learning for NLP
Coding Assignment 1
CS 544
Due: Thu April 4
[25 points] Task 3: Sparse Input Dropout
At this point, after improving tokenization and converting the first layer to be an embedding
layer, the model accuracy might have reduced… Do not worry! In fact, the model “train”
accuracy have improved at this point: but we do not care about that! We always only care
about the model generalization capability i.e. its performance on an unseen test examples,
as we do not want it to over-fit (i.e. memorize) the training data while simultaneously
performing bad on test data.
Thankfully, we have Deep Learning techniques to improve generalization. Specifically, we
will be using Dropout.
Dropping-out document terms helps generalization. For example, if a document contains
terms “A B C D”, then in one training batch the document could look like “A B D”, and in
another, it could look like “A C D”, and so on. This will essentially prevent our 3-layer neural
network from “memorizing” how the document looks like, as it appears different every time
(there are exponentially many different configurations a document can appear with dropout,
and all configurations are equally likely).
The issue is that we cannot use TensorFlow dropout layer out-of-the-box as it is designed
for dense Vectors and Matrices. Specifically, if we perform tf.contrib.layers.dropout on
the input data using:
ne t = t f . c o n t ri b . l a y e r s . dropout ( x , keep p r ob =0.5 , i s t r a i n i n g=i s t r a i n i n g )
Then, TensorFlow will be dropping half of the entries in x. But, this is almost useless because
most entries of x are already zero (most words do not occur in most documents). We wish
to be efficient and drop-out exactly words that appear in the documents rather than entries
that are already zero.
Thankfully, we have students to implement sparse-dropout for us! There are many possible
ways to implement sparse dropout. Your task is to:
• Please trace the usage of SparseDropout and fill its body. It currently reads as:
# ** TASK 3
d e f SparseDropout ( s l i c e x , keep p r ob = 0. 3 ):
“””Sets random (1 – keep_prob) non-zero elements of slice_x to zero.
Args:
slice_x: 2D numpy array (batch_size , vocab_size)
Returns:
2D numpy array (batch_size , vocab_size)
“””
r e t u r n s l i c e x
Use vectorized implementation with numpy’s advanced indexing. Slow solutions (i.e. using
for-loops in python) will receive at most 15/25 points.
5
Prepared By:
Sami & Ron
Deep Learning for NLP
Coding Assignment 1
CS 544
Due: Thu April 4
[10 points] Task 4: Tracking Auto-Created TensorFlow
Variables
TensorFlow is arguably the best abstraction for describing a graph of mathematical operations and programming Machine Learning models, but it is not perfect. One weakness of
TensorFlow is that it does not provide easy access to variables that are automatically created
by the layers (e.g. the fully-connected layer). Often times, one would like to grab a handle
on specific variable in a specific layer e.g. to visualize the embeddings, as we will do in the
next task.
To do this task, you will find the function tf.trainable variables() helpful. Hint: you
can print the contents of tf.trainable variables() in BuildInferenceNetwork, before
and after FirstLayer. Your taks is:
10 points Modify the code of BuildInferenceNetwork. In it, populate EMBEDDING VAR to be a
reference to the tf.Variable that holds the embedding dictionary Y . A snapshot of the
code is here for your reference:
d e f B uil d I n f e r e n c eN e tw o r k ( x ) :
“””From a tensor x, runs the neural network forward to compute outputs.
This essentially instantiates the network and all its parameters.
Args:
x: Tensor of shape (batch_size , vocab size) which contains a sparse matrix
where each row is a training example and containing counts of words
in the document that are known by the vocabulary.
Returns:
Tensor of shape (batch_size , 2) where the 2-columns represent class
memberships: one column discriminates between (negative and positive) and
the other discriminates between (deceptive and truthful).
“””
g l o b a l EMBEDDING VAR
EMBEDDING VAR = None # ** TASK 4: Move and set appropriately.
## Build layers starting from input.
ne t = x
# … continues to construct ‘net‘ layer -by-layer …
Set EMBEDDING VAR to a tf.Variable reference object. Keep first line: ‘global EMBEDDING VAR’.
6
Prepared By:
Sami & Ron
Deep Learning for NLP
Coding Assignment 1
CS 544
Due: Thu April 4
[25 points] Task 5: Visualizing the embedding layer
We want to visualize the embeddings learned by our Deep Network. The embedding layer
learns Y , a 40-dimensional embedding for each word in the vocabulary. You will project
the 40 dimensions onto 2 dimensions using sklearn tsne. Rather than visualizing all the
words, we will choose 4 kinds of words: Words indicating positive class (shown in blue),
negative class (shown in Orange), Words describing furniture (Red) and location (green).
Notice that the words that are useful for this classification task occupy different parts of
the embedding space: You can easily separate the orange and the Blue with a separating
hyperplane. In contrast, words not indicative of the classes (e.g. furniture, location) are not
as well clustered3
.
Successfully visualizing the embeddings using t-SNE should like this:
20 15 10 5 0 5 10 15 20
20
10
0
10
20
relaxing
upscale
luxury
luxurious
recommend
relax
choice
best
pleasant
incredible
magnificent
superb perfect fantastic
polite
gorgeous
beautiful
elegant
spacious
dirty rude uncomfortable
unfortunately ridiculous
disappointment
mediocre worst terrible
blocks avenue block doorman windows concierge
living
bedroom floor
table
coffee
window
bathroom
bath
couch pillow
This is a fairly open-ended task, but there should be decent documentation in the TASK 5
functions that you should implement: ComputeTSNE and VisualizeTSNE. Note: you must
separately upload the PDF produced by VisualizeTSNE onto Vocareum with name
tsne embeds.pdf.
3
In word2vec, which we will learn soon, the training is unsupervised as the document classes are not
known. As a result, all symantically similar words should cluster around one another as there are no classes
(not just ones indicative of any classes, as they are not present during training)
7