$30.00
The following problems relate to RNA secondary structure prediction (aka “RNA folding”), as described in lecture
and in section 6.5 of the text. In short, you will implement Nussinov’s algorithm, together with its associate “traceback”
routine.
Input: You should have a subroutine named “Nussinov” that has a single parameter which is a string of letters
x1x2 . . . xn from the 4 letter alphabet {A,G,C,U} (all uppercase, for simplicity), as in line [1] below. Your main
program should read a sequence of lines from “standard input” each containing one such string, and call “Nussinov”
on each. Different lines may be of different lengths (different “n”). For timing purposes, you should also generate
random sequences of length, say, n = 2k
for 4 ≤ k ≤ 12 or more, with A,C,G,U equally likely. Generate at least
one sequence of each length, preferably several. Call “Nussinov” on each. Generating and processing these random
sequences should be optional, and by default this option should be “off” so that our hard-working TAs don’t have to
endure the wait time, but it should be obvious how to enable it. E.g., your main program might look like:
while(!end_of_file(STDIN)){
seq = read_a_line();
Nussinov(seq);
}
if(FALSE){
for(k=4; k<=12; k++){
Nussinov(random_seq(2ˆk))
Nussinov(random_seq(2ˆk))
Nussinov(random_seq(2ˆk))
}
}
Output: For each “Nussinov” call, print to standard out one line containing the input, like [1] below, a second
line containing (one of) its optimal structure(s), formatted as in [2] below, plus a third line giving (i) the length of the
input, (ii) the total number of pairs in that structure, and (iii) the time, in seconds (or fractions thereof) as in [3] (note
that the structure/count shown are not optimal for this sequence). (iv) Additionally, for a length n input, if n < 25,
print the n × n OPT matrix calculated by Nussinov’s algorithm; print one line per row with n white-space-separated
integer values per line, preferably keeping columns vertically aligned. (v) Follow all of this by one blank line. E.g.:
GCCCACCUUCGAAAAGACUGGAUGACCAUGGGCCAUGAUU [1]
((((....((.....)).(((....))).))))....... [2]
Length = 40, Pairs = 9, Time = 0.0001 sec [3]
[4]
I say “one of” since there may be different structures with equal numbers of pairs, often slight
variants of each other. Giving any one of them is OK. The structure line will be a string of
parens and dots, vertically aligned with the input string. A dot in the structure line means that the
corresponding position in the RNA is unpaired; a left paren means it is paired with a position to its
right, marked by a right paren. Furthermore, parens must be properly balanced/nested, so specific
paired positions are marked by “matching” left/right parens. sequence).
G
C
C
CAC
C
U
U G C A
A
A A
G
AC U
G
G
A U G
A C
C
A
U
G
G
G
C C
A
U
G
U A
U
Method: Use the Nussinov algorithm, described in section 6.5 of the text, and my slides. (Again note that the
conventions used in the book and the slides differ; clearly state which you are using.)
What You Need To Do:
1
1. [30 points] Implement the Nussinov algorithm for calculating OPT[i, j].
2. [20 points] Devise and implement an algorithm to construct and print the structure (i.e., the string of parens and
dots). This is a “traceback,” similar to ones we’ve seen with other dynamic programming algorithms. I strongly
recommend that you look for a recursive algorithm to do this, but it is not required. You may (or may not) find
it convenient to create auxiliary data structures while you’re building OPT to facilitate the traceback.
As stated above, print the input, with the structure aligned vertically below it, and also print the number of pairs.
Additionally, print the OPT matrix if n ≤ 25. All output should go to standard out.
3. [20 points] Write a description of your traceback algorithm, explaining how it works/why it is correct.
4. [10 points] Analyze (separately and collectively) the (big-O) run time of the algorithms in problems 1 and 2.
5. [10 points] Measure the actual run time of your algorithm (total time for both parts) on random RNA sequences
of length 20–2000, say, plot them on a graph (e.g., Excel might be convenient, but is not required), and discuss
how this compares to the theoretical performance predicted in problem 4. For some tips on how to do the timing,
see the FAQ page.
Test Cases: TBD
Language: You may use C, C++, C#, Haskell, Java, Lisp, ML, Perl, Python, R, or Ruby; talk to me before
beginning if you prefer something else.
What/How To Turn In: TBD
2
WhatsApp us