## Description

In this assignment, you will use different ways to compute sequence similarity based on g-gapped kmers, to experience their relative. You will then perform some probability derivations using the three basic rules, and solve some generic problems. Finally, you will implement the forward algorithm for computing data likelihoods according to a first-order hidden Markov model, in a way that your program can handle any type of sequences. Questions: 1. This question is about g-gapped k-mers. Suppose we want to compute a similarity score between DNA sequences s1=GCGTCCGAC and s2=CGACGCGAC based on 1-gapped 3-mers (i.e., g=1, k=3). (a) How many different 1-gapped 3-mer patterns are there for DNA sequences? Show the steps in your calculation. (3%) (b) List all the 1-gapped 3-mers actually supported by s1 and s2 and their occurrence counts in the two sequences by filling in the following tables (add more rows if needed). The tables should be sorted in ascending order of the 1-gapped 3-mers, where the wildcard character * is ordered before the four nucleotides. (12%) Table for s1 1-gapped 3-mer No. of occurrence Table for s2 1-gapped 3-mer No. of occurrence (c) Based on your tables in Part b, compute the similarity between the two sequences, defined as the inner product of the two 1-gap 3-mer occurrence vectors. Show clearly the common 1-gap 3-mers of the two sequences in your calculations. (3%) (d) Now complete the following table listing the number of mismatches among the 4-mers in the two sequences. The row and column headers should be the 4-mers present in s1 and s2, respectively, both sorted ascendingly. If a 4-mer appears multiple times in a sequence, repeat it that number of times in the row/column headers. (8%) s2 s1 Copyright © 2018 Kevin Yip Page 2 of 4 (e) Based on your result in Part d, compute the similarity score between s1 and s2 again. Show the steps in your calculation. Do not forget to compare with your result in Part c. (4%) (f) In practice, it is not easy to decide which values of g and k to use. One possible approach is to try multiple combinations of these values, and use some independent knowledge to evaluate which combination leads to the best similarity scores. Propose two ideas for reducing the total computational time as compared to repeating the above calculations for each combination of g and k values. (6%) 2. This question is about statistical modeling. Suppose Y is a binary variable, and we want to construct a model of Y based on binary features X1 and X2, which may not be conditionally independent. (a) Derive a formula for computing Pr(Y=0|X1=1,X2=1) in terms of Pr(X1=1), Pr(X1=1,X2=0) and Pr(X1=1,X2=1,Y=1). Show your derivations step by step. (6%) (b) [Optional] In general, with three variables X1, X2 and Y, there are various possible probabilities of the form Pr(VL=vl|VR=vr), where VL and VR are two disjoint ordered sets of variables (and VR can be empty), and vl and vr are their corresponding values. For example, in the case of Pr(X1=1), VL=(X1), vl=(1), and VR= Æ. In the case of Pr(Y=0|X1=1,X2=1), VL=(Y), vl=(0), VR=(X1,X2), and vr=(1,1). Propose an algorithm that can take one of these probabilities as target (such as Pr(Y=0|X1=1,X2=1)) and some of these probabilities as input (such as Pr(X1=1), Pr(X1=1,X2=0) and Pr(X1=1,X2=1,Y=1)), and determine whether the input is sufficient for inferring the value of the target. (bonus 10%) (c) Now assume X1 and X2 are independent when conditioned on Y. Give a formal definition of this assumption using mathematical symbols. (3%) (d) Depending on the values of X1, X2 and Y, there are 8 probabilities of the form Pr(X1,X2|Y). Given an example set of values for these 8 probabilities such that X1 and X2 are not independent when conditioned on Y. Prove that the two variables are not conditionally independent. (5%) (e) If two variables X1 and X2 are independent when conditioned on Y, is it guaranteed that they are also unconditionally independent? Explain why or why not. (5%) (f) If two variables X1 and X2 are unconditionally independent, is it guaranteed that they are also independent when conditioned on Y? Explain why or why not. (5%) 3. Write a computer program called Forward in C, C++, Java or Python that implements the forward algorithm to compute the data likelihood of a given sequence based on a first-order hidden Markov model. This program should be able to handle any number of states and any number of emission symbols. Your program should take the following inputs in the specified order, each on a different line: • The number of states (an integer) • The initial probabilities of the states (floating-point numbers, one per line), in the order of p1, p2, … Copyright © 2018 Kevin Yip Page 3 of 4 • The transition probabilities among the states (floating-point numbers, one per line), in the order of p11, p12, …, p21, p22, … • The number of emission symbols (an integer) • The emission symbols (characters, one per line), in the order of b1, b2, … • The emission probabilities (floating-point numbers, one per line), in the order of e1(b1), e1(b2), …, e2(b1), e2(b2), … • The sequence the likelihood of which is to be computed (a string) You can assume the input data are properly formatted and do not need to check for errors. The output should be a single floating-point number of the likelihood value. Due to the inexact nature of floating-point numbers, when we check your answers a small amount of leeway may be considered. The non-comment portion of your program is expected to contain no more than 200 lines of code. Here is an expected screen shot when a Java program is run on an example from the lecture notes: >java Forward 2 0.5 0.5 0.9 0.1 0.8 0.2 4 A C G T 0.5 0.5 0 0 0.25 0.75 0 0 CAC 0.15156250000000002 Your program will be graded based on i) whether it can be compiled/run successfully, ii) whether it follows the input/output formats as specified above, iii) its accuracy on a number of test cases and iv) whether the program is well-documented with appropriate comments added to explain the meaning of the code. (40%) Submission: For the programming question, you should first submit your program to our online judge system. Please see tutorial notes set 1 for explanations. Copyright © 2018 Kevin Yip Page 4 of 4 For the non-programming question, give all your answers in a single file named _asmt3., where is your student ID and is either doc, docx or pdf. We prefer pdf files because it has better portability. Then put your files for both the programming and non-programming questions in a zip file named _asmt3.zip and submit it to Blackboard. Both your written and source code files should contain the following header. Contact Kevin before submitting the assignment if you have anything unclear about the guidelines on academic honesty. CSCI3220 2018-19 First Term Assignment 3 I declare that the assignment here submitted is original except for source material explicitly acknowledged, and that the same or closely related material has not been previously submitted for another course. I also acknowledge that I am aware of University policy and regulations on honesty in academic work, and of the disciplinary guidelines and procedures applicable to breaches of such policy and regulations, as contained in the following websites. University Guideline on Academic Honesty: http://www.cuhk.edu.hk/policy/academichonesty/ Student Name: Student ID : Marking Scheme and Notes: 1. Remember to submit your assignment by 23:59pm of the due date. We may not accept late submissions. 2. For the written part, if you submit multiple times, ONLY the content and time-stamp of the latest one before the submission deadline will be considered. For the program, the version submitted to the online judge system with the highest score will be graded. University Guideline for Plagiarism Please pay attention to the university policy and regulations on honesty in academic work, and the disciplinary guidelines and procedures applicable to breaches of such policy and regulations. Details can be found at http://www.cuhk.edu.hk/policy/academichonesty/. With each assignment, students will be required to submit a statement that they are aware of these policies, regulations, guidelines and procedures.