Programming Assignment 4 - Sequence Alignment

Due: April 18th at 5pm.

In this assignment you will program the sequence alignment algorithm that was presented in class (Section 6.6 in the textbook) and apply it in order to try and determine whether two given DNA sequences, one from the human genome and another from the mouse genome are likely to be evolutionarily related.

Your code should be structured as a single python file called HW4.py. First we describe the interface to the sequence alignment method.

>>> import HW4
>>> sequence1 = 'ATTAG'
>>> sequence2 = 'ATAG'
>>> alignment,score = HW4.align(sequence1, sequence2, mismatch_penalty, gap_penalty)
# your align method receives two sequences (strings), which you can
# assume are upper-case strings over the alphabet {A,C,G,T}.
# mismatch_penalty is the penalty assigned to a mismatch between
# two letters
# gap_penalty is the penalty assigned to a gap
# the return value is a tuple whose first element is the optimal
# alignment, and its second element is the score of the alignment.
# the alignment is formatted as a pair of strings:
>>> print alignment
('ATTAG', 'AT-AG')

To get a feel for how a bioinformatics analyst might use such a program consider the following file, that contains two DNA sequences in a format known as FASTA. The file contains two sequences, one from the human genome, and another from the mouse genome. As a first step, align the two sequences using your align function. The question we should ask ourselves at this point, is whether the score that it returned represents a real biological signal, or could have arisen by chance. To distinguish between the two hypotheses run the following experiment: Take the human sequence and align it against 20 sequences that were generated at random, and have the same frequency of {A,C,G,T} as the mouse sequence. Compare the average score achieved by aligning the human sequence against random sequences to the score obtained by aligning it against the mouse sequence. We ask you to capture this procedure using a function called random_align which has the following signature:

scores = HW4.random_align(sequence1, sequence2, mismatch_penalty, gap_penalty, num_trials)
# sequence1 takes the role of the human sequence from the example
# sequence2 takes the role of the mouse sequence
# num_trials is the number of random sequence to generate and align

clarification: random_align needs to return the scores you obtained in the alignments against random sequences, i.e. a list of length num_trials. In that function you will be aligning sequence1 against random sequences that are the same length and A,C,G,T composition as sequence2.

In the docstring of your module report on the results of your experiment with the given sequences. Do you think the two sequences are related?