## Description

1. (45 pts) Recall that the string alignment problem takes as input two strings x and y,

composed of symbols xi

, yj ∈ Σ, for a fixed symbol set Σ, and returns a minimal-cost

set of edit operations for transforming the string x into string y.

Let x contain nx symbols, let y contain ny symbols, and let the set of edit operations be

those defined in the lecture notes (substitution, insertion, deletion, and transposition).

Let the cost of indel be 1, the cost of swap be 13 (plus the cost of the two sub ops),

and the cost of sub be 12, except when xi = yj

, which is a “no-op” and has cost 0.

In this problem, we will implement and apply three functions.

(i) alignStrings(x,y) takes as input two ASCII strings x and y, and runs a dynamic

programming algorithm to return the cost matrix S, which contains the optimal costs

for all the subproblems for aligning these two strings.

alignStrings(x,y) : // x,y are ASCII strings

S = table of length nx by ny // for memoizing the subproblem costs

initialize S // fill in the basecases

for i = 1 to nx

for j = 1 to ny

S[i,j] = cost(i,j) // optimal cost for x[0..i] and y[0..j]

}}

return S

(ii) extractAlignment(S,x,y) takes as input an optimal cost matrix S, strings x, y,

and returns a vector a that represents an optimal sequence of edit operations to convert

x into y. This optimal sequence is recovered by finding a path on the implicit DAG of

decisions made by alignStrings to obtain the value S[nx, ny], starting from S[0, 0].

extractAlignment(S,x,y) : // S is an optimal cost matrix from alignStrings

initialize a // empty vector of edit operations

[i,j] = [nx,ny] // initialize the search for a path to S[0,0]

while i > 0 or j > 0

a[i] = determineOptimalOp(S,i,j,x,y) // what was an optimal choice?

[i,j] = updateIndices(S,i,j,a) // move to next position

}

return a

When storing the sequence of edit operations in a, use a special symbol to denote

no-ops.

1

CSCI 3104

Problem Set 7

(iii) commonSubstrings(x,L,a) which takes as input the ASCII string x, an integer

1 ≤ L ≤ nx, and an optimal sequence a of edits to x, which would transform x into

y. This function returns each of the substrings of length at least L in x that aligns

exactly, via a run of no-ops, to a substring in y.

(a) From scratch, implement the functions alignStrings, extractAlignment, and

commonSubstrings. You may not use any library functions that make their implementation trivial. Within your implementation of extractAlignment, ties must

be broken uniformly at random.

Submit (i) a paragraph for each function that explains how you implemented it

(describe how it works and how it uses its data structures), and (ii) your code

implementation, with code comments.

Hint: test your code by reproducing the APE / STEP and the EXPONENTIAL /

POLYNOMIAL examples in the lecture notes (to do this exactly, you’ll need to use

unit costs instead of the ones given above).

(b) Using asymptotic analysis, determine the running time of the call

commonSubstrings(x, L, extractAlignment( alignStrings(x,y), x,y ) )

Justify your answer.

(c) (15 pts extra credit) Describe an algorithm for counting the number of optimal

alignments, given an optimal cost matrix S. Prove that your algorithm is correct,

and give is asymptotic running time.

Hint: Convert this problem into a form that allows us to apply an algorithm we’ve

already seen.

(d) String alignment algorithms can be used to detect changes between different versions of the same document (as in version control systems) or to detect verbatim

copying between different documents (as in plagiarism detection systems).

The two data string files for PS7 (see class Moodle) contain actual documents

recently released by two independent organizations. Use your functions from (1a)

to align the text of these two documents. Present the results of your analysis,

including a reporting of all the substrings in x of length L = 9 or more that could

have been taken from y, and briefly comment on whether these documents could

be reasonably considered original works, under CU’s academic honesty policy.

2. (20 pts) Ron and Hermione are having a competition to see who can compute the nth

Pell number Pn more quickly, without resorting to magic. Recall that the nth Pell

number is defined as Pn = 2 Pn−1 + Pn−2 for n > 1 with base cases P0 = 0 and P1 = 1.

Ron opens with the classic recursive algorithm:

2

CSCI 3104

Problem Set 7

Pell(n) :

if n == 0 { return 0 }

else if n == 1 { return 1 }

else { return 2*Pell(n-1) + Pell(n-2) }

which he claims takes R(n) = R(n − 1) + R(n − 2) + c = O(φ

n

) time.

(a) Hermione counters with a dynamic programming approach that “memoizes” (a.k.a.

memorizes) the intermediate Pell numbers by storing them in an array P[n]. She

claims this allows an algorithm to compute larger Pell numbers more quickly, and

writes down the following algorithm.1

MemPell(n) {

if n == 0 { return 0 } else if n == 1 { return 1 }

else {

if (P[n] == undefined) { P[n] = 2*MemPell(n-1) + MemPell(n-2) }

return P[n]

}

}

i. Describe the behavior of MemPell(n) in terms of a traversal of a computation

tree. Describe how the array P is filled.

ii. Determine the asymptotic running time of MemPell. Prove your claim is

correct by induction on the contents of the array.

(b) Ron then claims that he can beat Hermione’s dynamic programming algorithm

in both time and space with another dynamic programming algorithm, which

eliminates the recursion completely and instead builds up directly to the final

solution by filling the P array in order. Ron’s new algorithm2

is

DynPell(n) :

P[0] = 0, P[1] = 1

for i = 2 to n { P[i] = 2*P[i-1] + P[i-2] }

return P[n]

Determine the time and space usage of DynPell(n). Justify your answers and

compare them to the answers in part (2a).

1Ron briefly whines about Hermione’s P[n]=undefined trick (“an unallocated array!”), but she point

out that MemPell(n) can simply be wrapped within a second function that first allocates an array of size n,

initializes each entry to undefined, and then calls MemPell(n) as given.

2Ron is now using Hermione’s undefined array trick; assume he also uses her solution of wrapping this

function within another that correctly allocates the array.

3

CSCI 3104

Problem Set 7

(c) With a gleam in her eye, Hermione tells Ron that she can do everything he can

do better: she can compute the nth Pell number even faster because intermediate

results do not need to be stored. Over Ron’s pathetic cries, Hermione says

FasterPell(n) :

a = 0, b = 1

for i = 2 to n

c = 2*a + b

a = b

b = c

end

return a

Ron giggles and says that Hermione has a bug in her algorithm. Determine

the error, give its correction, and then determine the time and space usage of

FasterPell(n). Justify your claims.

(d) In a table, list each of the four algorithms as columns and for each give its asymptotic time and space requirements, along with the implied or explicit data structures that each requires. Briefly discuss how these different approaches compare,

and where the improvements come from. (Hint: what data structure do all recursive algorithms implicitly use?)

(e) (5 pts extra credit) Implement FasterPell and then compute Pn where n is the

four-digit number representing your MMDD birthday, and report the first five

digits of Pn. Now, assuming that it takes one nanosecond per operation, estimate

the number of years required to compute Pn using Ron’s classic recursive algorithm

and compare that to the clock time required to compute Pn using FasterPell.

4