# CS 6501-005 Homework 02: Sequential Models

\$30.00

## Description

1. Conditional Random Fields As we explained in class, one challenge of learning a conditional
random fields is to compute the denominator of the following equation
p(y | x) = exp(θ
>f(x, y))
P
y0∈YT exp(θ>f(x, y0)) (1)
where x = (x1, . . . , xT ), in the case of POS tagging, is a sequence of words with length T, y =
(y1, . . . , yT ) is the corresponding sequence of POS tags, and Y
T
is the set of all possible POS
sequences with length T. Assume the size of the pre-defined POS set is K, simple brute force
algorithm of computing P
y0∈YT exp(θ
>f(x, y
0
)) will have the complexity KT
, which is almost
impossible in practice. On the other hand, with the decomposition of f(x, y) as
f(x, y) = X
T
i=1
fi(xi
, yi
, yi−1) + fT +!(yT +1, yT ), (2)
with y0 = and yT +1 = , we are able to design an algorithm with time complexity O(T · K2
).
Note that, in equation (2) on lecture 08 slides, we ignore fT +!(yT +1, yT ) for simplicity.
• (3 points) Please follow a similar idea of Viterbi decoding to design a dynamic programming
algorithm that can compute
X
y0∈YT
exp(θ
>f(x, y
0
)).
with time complexity O(T · K2
). [Hint: The framework of your algorithm should look like
Algorithm 11 on page 150 in [Eisenstein, 2018], without the back tracing part.]
• (1 point) Please justify the time complexity of this algorithm is O(T · K2
).
2. POS Tagging From the attachment, you can find three files for this part.
• trn.pos
• dev.pos
• tst.pos
In both trn.pos and dev.pos, each line is one sentence, and each token consists of a word and its
POS tag, as
word/POS
As you may notice, the POS tag set is much smaller than the conventional POS tag set in the Penn
Treebank. There are only 10 tags in total
Y = {A, C, D, M, N, O, P, R, V, W}
1
Remember, there are also two special tags for POS tagging: and . We are going to estimate
both transition and emission probabilities first, then use them to build a POS tagger for the data
in the dev set.
(a) Preprocessing. Scan the training examples in trn.pos and map the words with frequency less
than K = 2 into a special unknown token unk.
(b) From the training examples in trn.pos, estimate the transition probability as
p(y | y
0
) = #(y, y0
) + α
#(y
0) + α · (|Y| + 1) (3)
where y and y
0
can be any two possible tags in Y and |Y| is the size of Y, and α = 1. Note
that, there are alsoe two special cases p(y | ) and p( | y
0
) that you need to consider. Make
sure p(y | y
0
) are valid probabilities, such as
X
y∈Y∪{}
p(y | y
0
) = 1 (4)
where y
0 ∈ Y ∪ {}.
(c) From the training examples in trn.pos, estimate the emission probability as
p(x | y) = #(x, y) + β
#(y) + β · |V| (5)
where y ∈ Y, x ∈ V is any word in the vocab V, |V| is the size of V, and β = 1. Again, make
sure p(x | y) are valid probabilities, such as
X
x∈V
p(x | y) = 1 (6)
(d) (2 points) Convert the estimated probabilities from the previous step into log space and implement the Viterbi decoding algorithm as in Algorithm 11in [Eisenstein, 2018]. Report the
accuracy of your decoder on the dev set dev.pos.
(e) (2 points) Tune α and β to find the best accuracy on the dev set. Report the best accuracy
number and the values of α and β.
(f) (2 points) Run the model selected from the previous step on the test set tst.pos, and report
the accuracy.
Note that
• Please implement your code from scratch. You are not allowed to use any existing machine
learning package to solve this problem — it’s OK to use NumPy and SciPy if you want.
• To avoid zero point on this problem, please submit your code (including the best values of α
and β) with file name [ComputingID]-pos.py or [ComputingID]-pos.ipynb (if you use
IPython notebook).
References
[Eisenstein, 2018] Eisenstein, J. (2018). Natural Language Processing. MIT Press.
2