CMPSC 442: Homework 6

$35.00

Category: You will Instantly receive a download link for .zip solution file upon Payment

Description

5/5 - (4 votes)

1. Hidden Markov Models [95 points]

In this section, you will develop a hidden Markov model for part-of-speech (POS) tagging, using the
Brown corpus as training data. The tag set used in this assignment will be the universal POS tag set,
which is composed of the twelve POS tags NOUN (noun), VERB (verb), ADJ (adjective), ADV (adverb),
PRON (pronoun), DET (determiner or article), ADP (preposition or postposition), NUM (numeral),
CONJ (conjunction), PRT (particle), ‘.’ (punctuation mark), and X (other).
As in previous assignments, your use of external code should be limited to built-in Python modules,
which excludes packages such as NumPy and NLTK.

1. [10 points] Write a function load_corpus(path) that loads the corpus at the given path and
returns it as a list of POS-tagged sentences. Each line in the file should be treated as a
separate sentence, where sentences consist of sequences of whitespace-separated strings of
the form “token=POS”. Your function should return a list of lists, with individual entries
being 2-tuples of the form (token, POS).
>>> c = load_corpus(“brown_corpus.txt”)
>>> c[1402]
[(‘It’, ‘PRON’), (‘made’, ‘VERB’),
(‘him’, ‘PRON’), (‘human’, ‘NOUN’),
(‘.’, ‘.’)]
>>> c = load_corpus(“brown_corpus.txt”)
>>> c[1799]
[(‘The’, ‘DET’), (‘prospects’, ‘NOUN’),
(‘look’, ‘VERB’), (‘great’, ‘ADJ’),
(‘.’, ‘.’)]

2. [20 points] In the Tagger class, write an initialization method
__init__(self, sentences) which takes a list of sentences in the form produced by
load_corpus(path) as input and initializes the internal variables needed for the POS tagger.
In particular, if { t1, t2, . . . , tn } denotes the set of tags and { w1, w2, . . . , wm } denotes the set
of tokens found in the input sentences, you should at minimum compute:

The initial tag probabilities π(ti
) for 1 ≤ i ≤ n, where π(ti
) is the probability that a
sentence begins with tag ti
.
The transition probabilities a(ti → tj
) for 1 ≤ i, j ≤ n, where a(ti → tj
) is the probability
that tag tj
occurs after tag ti
.
The emission probabilities b(ti → wj
) for 1 ≤ i ≤ n and 1 ≤ j ≤ m, where b(ti → wj
) is the
probability that token wj
is generated given tag ti
.
It is imperative that you use Laplace smoothing where appropriate to ensure that your system
can handle novel inputs, but the exact manner in which this is done is left up to you as a
design decision. Your initialization method should take no more than a few seconds to
complete when given the full Brown corpus as input.

3. [25 points] In the Tagger class, write a method most_probable_tags(self, tokens)
which returns the list of the most probable tags corresponding to each input token. In
particular, the most probable tag for a token wj
is defined to be the tag with index i
* =
argmaxi
b(ti → wj
).
>>> c = load_corpus(“brown_corpus.txt”)
>>> t = Tagger(c)
>>> t.most_probable_tags(
… [“The”, “man”, “walks”, “.”])
[‘DET’, ‘NOUN’, ‘VERB’, ‘.’]
>>> c = load_corpus(“brown_corpus.txt”)
>>> t = Tagger(c)
>>> t.most_probable_tags(
… [“The”, “blue”, “bird”, “sings”])
[‘DET’, ‘ADJ’, ‘NOUN’, ‘VERB’]

4. [40 points] In the Tagger class, write a method viterbi_tags(self, tokens) which
returns the most probable tag sequence as found by Viterbi decoding. Recall from lecture that
Viterbi decoding is a modification of the Forward algorithm, adapted to find the path of
highest probability through the trellis graph containing all possible tag sequences.
Computation will likely proceed in two stages: you will first compute the probability of the
most likely tag sequence, and will then reconstruct the sequence which achieves that
probability from end to beginning by tracing backpointers.
>>> c = load_corpus(“brown_corpus.txt”)
>>> t = Tagger(c)
>>> s = “I am waiting to reply”.split()
>>> t.most_probable_tags(s)
[‘PRON’, ‘VERB’, ‘VERB’, ‘PRT’, ‘NOUN’]
>>> t.viterbi_tags(s)
[‘PRON’, ‘VERB’, ‘VERB’, ‘PRT’, ‘VERB’]
>>> c = load_corpus(“brown_corpus.txt”)
>>> t = Tagger(c)
>>> s = “I saw the play”.split()
>>> t.most_probable_tags(s)
[‘PRON’, ‘VERB’, ‘DET’, ‘VERB’]
>>> t.viterbi_tags(s)
[‘PRON’, ‘VERB’, ‘DET’, ‘NOUN’]

2. Feedback [5 points]

1. [1 point] Approximately how long did you spend on this assignment?

2. [2 points] Which aspects of this assignment did you find most challenging? Were there any
significant stumbling blocks?

3. [2 points] Which aspects of this assignment did you like? Is there anything you would have
changed?