Description
1. Conditional Random Fields As we explained in class, one challenge of learning a conditional
random fields is to compute the denominator of the following equation
p(y | x) = exp(θ
>f(x, y))
P
y0∈YT exp(θ>f(x, y0)) (1)
where x = (x1, . . . , xT ), in the case of POS tagging, is a sequence of words with length T, y =
(y1, . . . , yT ) is the corresponding sequence of POS tags, and Y
T
is the set of all possible POS
sequences with length T. Assume the size of the pre-defined POS set is K, simple brute force
algorithm of computing P
y0∈YT exp(θ
>f(x, y
0
)) will have the complexity KT
, which is almost
impossible in practice. On the other hand, with the decomposition of f(x, y) as
f(x, y) = X
T
i=1
fi(xi
, yi
, yi−1) + fT +!(yT +1, yT ), (2)
with y0 = and yT +1 = , we are able to design an algorithm with time complexity O(T · K2
).
Note that, in equation (2) on lecture 08 slides, we ignore fT +!(yT +1, yT ) for simplicity.
• (3 points) Please follow a similar idea of Viterbi decoding to design a dynamic programming
algorithm that can compute
X
y0∈YT
exp(θ
>f(x, y
0
)).
with time complexity O(T · K2
). [Hint: The framework of your algorithm should look like
Algorithm 11 on page 150 in [Eisenstein, 2018], without the back tracing part.]
• (1 point) Please justify the time complexity of this algorithm is O(T · K2
).
2. POS Tagging From the attachment, you can find three files for this part.
• trn.pos
• dev.pos
• tst.pos
In both trn.pos and dev.pos, each line is one sentence, and each token consists of a word and its
POS tag, as
word/POS
As you may notice, the POS tag set is much smaller than the conventional POS tag set in the Penn
Treebank. There are only 10 tags in total
Y = {A, C, D, M, N, O, P, R, V, W}
1
Remember, there are also two special tags for POS tagging: and . We are going to estimate
both transition and emission probabilities first, then use them to build a POS tagger for the data
in the dev set.
Please following the steps to build your POS tagger
(a) Preprocessing. Scan the training examples in trn.pos and map the words with frequency less
than K = 2 into a special unknown token unk.
(b) From the training examples in trn.pos, estimate the transition probability as
p(y | y
0
) = #(y, y0
) + α
#(y
0) + α · (|Y| + 1) (3)
where y and y
0
can be any two possible tags in Y and |Y| is the size of Y, and α = 1. Note
that, there are alsoe two special cases p(y | ) and p( | y
0
) that you need to consider. Make
sure p(y | y
0
) are valid probabilities, such as
X
y∈Y∪{}
p(y | y
0
) = 1 (4)
where y
0 ∈ Y ∪ {}.
(c) From the training examples in trn.pos, estimate the emission probability as
p(x | y) = #(x, y) + β
#(y) + β · |V| (5)
where y ∈ Y, x ∈ V is any word in the vocab V, |V| is the size of V, and β = 1. Again, make
sure p(x | y) are valid probabilities, such as
X
x∈V
p(x | y) = 1 (6)
(d) (2 points) Convert the estimated probabilities from the previous step into log space and implement the Viterbi decoding algorithm as in Algorithm 11in [Eisenstein, 2018]. Report the
accuracy of your decoder on the dev set dev.pos.
(e) (2 points) Tune α and β to find the best accuracy on the dev set. Report the best accuracy
number and the values of α and β.
(f) (2 points) Run the model selected from the previous step on the test set tst.pos, and report
the accuracy.
Note that
• Please implement your code from scratch. You are not allowed to use any existing machine
learning package to solve this problem — it’s OK to use NumPy and SciPy if you want.
• To avoid zero point on this problem, please submit your code (including the best values of α
and β) with file name [ComputingID]-pos.py or [ComputingID]-pos.ipynb (if you use
IPython notebook).
References
[Eisenstein, 2018] Eisenstein, J. (2018). Natural Language Processing. MIT Press.
2