Description
Problem 1: Viterbi Algorithm [50 points]
We learned Hidden Markov Models (HMMs) as a generative model, and the Viterbi
Algorithm is used to compute the most likely configuration of the hidden variables.
HMMs are useful for various tasks such as Speech Recognition, Machine Translation, and
Signal Encoding, and consequently the Viterbi algorithm is highly important technique
in Machine Learning. In this assignment, we will investigate a code translation problem
that can also be solved by the Viterbi algorithm.
In the beginning of the semester, Ian created a simple linguistic code that utilizes several Greek alphabets in order to safely encrypt his password. After a few months, he
forgets the original English password, and only remembers the Greek encryption. He
believes that strong similarity between English and Greek alphabets was incorporated
for his encryption, whereas there was no deterministic mapping between English and
Greek characters. Thus we hypothesize that sequences of Greek characters correspond
to English words and plan to build an HMM to help him decipher his password.
3-1
Formally speaking, we represent a Greek word as x =(x1, x2…xT ), where xt ∈ {α, τ, η, γ, ω}
is the t
th character in the word (represented in Greek script). Given each Greek word, our
goal is to predict the most probable English word y=(y1, y2…yT ), where yt ∈ {a, i, p, s} is
the English translation of the t
th character (represented in English script). The following
tables give the transition and emission probabilities of our HMM.
a i p s
a 0.05 0.1 0.15 0.7
i 0.1 0.05 0.25 0.6
p 0.45 0.15 0.05 0.35
s 0.35 0.2 0.15 0.3
START 0.1 0.4 0.2 0.3
Table 1: Transition Probabilities P(yt
|yt−1) where yt−1 on each row, yt on each column
α τ η γ ω
a 0.4 0.2 0.1 0.2 0.1
i 0.3 0.1 0.4 0.1 0.1
p 0.1 0.1 0.1 0.2 0.5
s 0.1 0.4 0.1 0.3 0.1
Table 2: Emission Probabilities P(xt
|yt) where yt on each row, xt on each column
In HMMs, the transition probability P(yt
|yt−1) decides the probability of the current
English character given the previous English character, whereas the emission probability P(xt
|yt) determines the probability of a Greek translation given a English character.
Based on these two model probabilities, we can compute the most likely English translation of each Greek word x =(x1, x2…xT ) via the following HMM formula:
argmaxy1,y2…yT
P(y1, y2…yT |x1, x2…xT ) = argmaxy1,y2…yT
P(y1)P(x1|y1)
Y
T
t=2
P(xt
|yt)P(yt
|yt−1)
Using the above equation and the Viterbi algorithm, answer the following questions:
(a) [5 points] Let δs,t be the probability of the most probable English sequence corresponding to the first t Greek observations (x1, x2, …, xt) that end with the English
character s. Write the recurrence relation of δs,t in terms of model probabilities1
and specify the initial conditions2
for 2D dynamic programming (s ∈ {a, i, p, s},
1 ≤ t ≤ T)
1 We have three probabilities: transition, emission, and start probabilities given in the Table 1 & 2
2 To perform dynamic programming, you have to initialize the base cases at t = 1 for all s ∈ {a, i, p, s}
3-2
(b) [25 points] Ian’s encrypted password is ”ωαγτ ητ γαω”. What are the most likely
English translations for each of the Greek encrypted words: 1) ητ , 2) ωηα, 3) ωαγτ?
For each Greek word, all you have to do is to fill the 2D dynamic programming table3
whose entries are δs,t with specifying the backtracking path that corresponds to the
most probable translation. What is indeed Ian’s English password?
(c) [10 points] Suppose English has m characters and Greek has n characters in their
alphabets. What is the running time of the translation from a Greek word of length
T into an English word via the Viterbi algorithm? Compare it to the running time
of the brute-force algorithm naively considering all possible combinations in terms
of big-O complexity. (You could assume reading the model probabilities takes only
a constant time)
(d) [5 points] Given the transition and emission probabilities of a first-order HMM
model, describe how to compute the probability of a Greek word x=(x1, x2…xT ).
(i.e., P(X = (x1, x2, …, xt))) You have to clearly specify the equations you formulate
for calculating this probability as well as a precise 2-4 sentence description of how
your algorithm would work. (note: Solutions with exponential time complexity will
get the full credits, but a clear description of sub-exponential time algorithm will get
the bonus points)
(e) [5 points] We can similarly utilize HMMs to translate sentences of one language into
another. In this sentence-level translation, each character in our problem corresponds
to a word, and a sequence of characters corresponds to a sentence. How well would
this HMM-based approach perform as compared to the state-of-the-art (like Google
Translate) for languages common today, say English and Spanish? Briefly explain
why you think so in 2-3 sentences.
Problem 2: Statistical Learning Theory [50 points]
For the questions below, clearly explain all steps in your derivations. You may use any
result proved in class.
(a) Find the tightest lower bound that you can on the VC dimension for each of the
following hypotheses classes. Present the sets of points that can be shattered for
each of the following:
• Intervals of the reals For an instance space in R, let H be the set of all
intervals on the real line.
• Pairs of intervals of the reals For an instance space in R, let H be the set
of all pairs of intervals on the real line.
3 Conventionally s will vary on the row side and t will vary on the column side in the table.
3-3
• Rectangular hypotheses centered at the origin For an instance space in
R
2
, let h ∈ H be a hypothesis: h(x) = 1{−a < x1 < a, −b < x2 < b} for
a, b ∈ R.
• Rectangular hypotheses centered arbitrarily For an instance space in R
2
,
let h ∈ H be a hypothesis: h(x) = 1{a < x1 < b, c < x2 < d} for a, b, c, d ∈ R.
• Circular hypotheses centered arbitrarily For an instance space in R
2
, let
h ∈ H be a hypothesis: h(x) = 1{||x − c||2 < r} for r ∈ R, c ∈ R2
(b) Let {0, 1}
n be the instance space of a binary classification task. This means that an
instance x consists of n binary-valued features. Let Hn be the class of all boolean
functions over the input domain. What is |Hn| and the VC dimension of Hn? Show
all work leading you to your answer.
(c) Now, suppose we are interested in a binary classification task, where our examples
x are all 100-dimensional binary vectors of the form x = (x1, x2, . . . , x100), where
each xi ∈ {0, 1}. As above, we want to learn a binary hypothesis h over the instances. Consider a “restricted” linear hypothesis h that can be described using a
100-dimensional weight vector w = (w1, w2, . . . , w100) and a bias b, where all components of the weight vector are binary, i.e. ∀i : wi ∈ {−1, 1} and the bias in an
integer between -15 and 15: b ∈ {−15, −14, . . . , 0, . . . , 14, 15}. The classification rule
for classifying an example x is sign(wT x + b).
The hypothesis space H consists of all such w, b. Given a training set S with |S| = n
examples, and a hypothesis with 0 training error, give a bound for the prediction
error of this hypothesis that will hold with probability 1 − δ in terms of δ and n.
(d) For H defined as above, let hˆ
S = argminh∈HErrS(h). How large must the training
set size |S| be for P
|ErrS(hˆ
S) − ErrP (hˆ
S)| ≥ 0.1
≤ 0.1?
3-4