Description
In this assignment, you will first use dynamic programming to solve optimal sequence alignment problems, both manually and by writing a computer program. You will then study ways to skip some entries in the dynamic programming table that would not affect the final results. Finally, you will go through some steps of FASTA on some example sequences. The bonus part is for those who enjoy challenges. The maximum score you can get from this assignment is capped at 100, but doing this bonus part would give you a better chance of getting a high score. Questions: In Questions 1-2, we will use a scoring scheme of having +2 scores for a match, -1 score for a mismatch, and -2 scores for an indel without applying affine gap penalty. 1. Fill in the following dynamic programming table to perform optimal global alignment between the sequences r=GTACC and s=CCTCC based on the scoring scheme stated above. Draw all arrows that lead to the score of each cell (i.e., only the “red arrows”). You can use the PowerPoint table template in the file CSCI3220_2018Fall_Assignment1_template.pptx if you want. Report the optimal alignment score and all alignments with this optimal score. When showing an alignment, indicate the names of the sequences clearly. (20%) s r G T A C C C T C C C Ø Ø Copyright © 2018 Kevin Yip Page 2 of 5 2. Repeat Question 1 but perform a local alignment instead of a global alignment. When showing an alignment, indicate the sequence names and the starting and ending positions of the aligned regions clearly. (20%) 3. Write a computer program in C, C++, Java or Python for performing optimal pairwise global alignments of DNA sequences using dynamic programming, without affine gap penalty. Your program should take the following inputs from stdin (not from a file) in the specified order, each occupying a different line: i. Score for a match (a positive integer) ii. Score for a mismatch (a negative integer) iii. Score for an indel (a negative integer) iv. First DNA sequence (a text string) v. Second DNA sequence (a text string) You do not need to check for errors in the inputs. Your program should output to the screen (i.e., stdout) the highest alignment score and all optimal alignments in the following format: [ ] The part in square brackets repeats for each optimal alignment. During the traceback, whenever a table entry has multiple incoming arrows, the horizontal one should be the first to traceback, s r G T A C C C T C C C Ø Ø Copyright © 2018 Kevin Yip Page 3 of 5 followed by the diagonal one, and finally the vertical one. The resulting optimal alignments in the output should also follow this order. The non-comment portion of your program is expected to contain no more than 150 lines of code. Here is a screen shot when a sample Java program called DP was run on an example from the lecture notes: >java DP 1 -1 -1 ATGCGT ACGGCGT 3 A_TGCGT ACGGCGT AT_GCGT ACGGCGT ATG_CGT ACGGCGT Your program will be graded based on i) whether it can be compiled/run successfully, ii) whether it follows the input/output formats as specified above, iii) its accuracy on a number of test cases and iv) whether the program is well-documented with appropriate comments added to explain the meaning of the code. In view of the large size of our class, for easy grading, you may get zero score if your program cannot be compiled or it does not conform to the input/output formats specified. Do not forget to use your program to check your answer to Question 1, and the examples in the lecture notes and tutorial notes. (30%) 4. In Question 1, we found the optimal global alignment(s) by filling in the whole dynamic programming table. However, given the specific scoring scheme we considered (+2 for a match, -1 for a mismatch and -2 for an indel), it is actually not necessary to fill in the whole table. (a) List all the entries in the dynamic programming table that can never be involved in an optimal global alignment between any two length-5 sequences given the above scoring scheme. Explain why these entries can never be involved in an optimal global alignment. (Hint: Consider what score each table entry needs to achieve in order to be involved in an optimal alignment, and whether the entry can actually achieve this score.) (14%) (b) Based on your answer in Part a, explain how the dynamic programming algorithm can be modified to skip the table entries that can never be involved in an optimal global alignment, assuming that the table is still a 6´6 two-dimensional array. Give as much specific detail as possible. (6%) (c) [Optional] Now we generalize the results in Part a to two sequences of any lengths |r|=n and |s|=m³n, and any reasonable scoring scheme without affine gap penalty. Specifically, Copyright © 2018 Kevin Yip Page 2 of 5 2. Repeat Question 1 but perform a local alignment instead of a global alignment. When showing an alignment, indicate the sequence names and the starting and ending positions of the aligned regions clearly. (20%) 3. Write a computer program in C, C++, Java or Python for performing optimal pairwise global alignments of DNA sequences using dynamic programming, without affine gap penalty. Your program should take the following inputs from stdin (not from a file) in the specified order, each occupying a different line: i. Score for a match (a positive integer) ii. Score for a mismatch (a negative integer) iii. Score for an indel (a negative integer) iv. First DNA sequence (a text string) v. Second DNA sequence (a text string) You do not need to check for errors in the inputs. Your program should output to the screen (i.e., stdout) the highest alignment score and all optimal alignments in the following format: [ ] The part in square brackets repeats for each optimal alignment. During the traceback, whenever a table entry has multiple incoming arrows, the horizontal one should be the first to traceback, s r Copyright © 2018 Kevin Yip Page 4 of 5 consider a scoring scheme with a match score of smatch>0, a mismatch score of smismatch<0, and an indel score of sindel_asmt1., where is either doc, docx or pdf. We prefer pdf files because it has better portability, which guarantees for instance the arrows are displayed at the correct locations no matter on which machine the file is opened. If you need to use a doc or docx file, include images for your dynamic programming tables rather than drawing objects (if you use the PowerPoint template, choose “Paste Special” ® “Picture (enhanced metafile)” or “PDF”). Both your written and source code files should contain the following header. Contact Kevin before submitting the assignment if you have anything unclear about the guidelines on academic honesty. CSCI3220 2018-19 First Term Assignment 1 I declare that the assignment here submitted is original except for source material explicitly acknowledged, and that the same or closely related material has not been previously submitted for another course. I also acknowledge that I am aware of University policy and regulations on honesty in academic work, and of the disciplinary guidelines and procedures applicable to breaches of such policy and regulations, as contained in the following websites. University Guideline on Academic Honesty: http://www.cuhk.edu.hk/policy/academichonesty/ Student Name: Student ID : Copyright © 2018 Kevin Yip Page 5 of 5 Finally, submit both your programming source code and your file for the non-programming questions in a single zip package named _asmt1.zip to Blackboard. Marking Scheme and Notes: 1. Remember to submit your assignment by 23:59pm of the due date. We may not accept late submissions. 2. For both the written part and the program, if you submit multiple times, ONLY the content and time-stamp of the latest one before the submission deadline will be considered. University Guideline for Plagiarism Please pay attention to the university policy and regulations on honesty in academic work, and the disciplinary guidelines and procedures applicable to breaches of such policy and regulations. Details can be found at http://www.cuhk.edu.hk/policy/academichonesty/. With each assignment, students will be required to submit a statement that they are aware of these policies, regulations, guidelines and procedures.