Main.Assignment4 History
Hide minor edits - Show changes to output
Changed line 41 from:
'clarification:' @@random_align@@ needs to return the scores you obtained in the alignments against random sequences, i.e. a list of length @@num_trials@@. In that function you will be aligning sequence1 against random sequences that are the same length and @@A,C,G,T@@ composition as sequence2.
to:
'''clarification:''' @@random_align@@ needs to return the scores you obtained in the alignments against random sequences, i.e. a list of length @@num_trials@@. In that function you will be aligning sequence1 against random sequences that are the same length and @@A,C,G,T@@ composition as sequence2.
Added lines 39-41:
'clarification:' @@random_align@@ needs to return the scores you obtained in the alignments against random sequences, i.e. a list of length @@num_trials@@. In that function you will be aligning sequence1 against random sequences that are the same length and @@A,C,G,T@@ composition as sequence2.
Changed line 3 from:
!!!Due: April 15th at 5pm.
to:
!!!Due: April 18th at 5pm.
Changed line 34 from:
scores = HW4.random_align(sequence1, sequence2, num_trials)
to:
scores = HW4.random_align(sequence1, sequence2, mismatch_penalty, gap_penalty, num_trials)
Changed lines 29-30 from:
As a first step, align the two sequences using your align method.
The question we should ask ourselves at this point, is whether the score that it returned represents a real biological signal, or could have arisen by chance. To distinguish between the two hypotheses run the following experiment: Take the human sequence and align it against 20 sequences that were generated at random, and have the same frequency of {A,C,G,T} as the mouse sequence. We ask you to capture this procedure using a function called @@random_align@@ which has the following signature:
The question we should ask ourselves at this point, is whether the score that it returned represents a real biological signal, or could have arisen by chance. To distinguish between the two hypotheses run the following experiment: Take the human sequence and align it against 20 sequences that were generated at random, and have the same frequency of {A,C,G,T} as the mouse sequence. We ask you to capture this procedure using a function called @@random_align@@ which has the following signature:
to:
As a first step, align the two sequences using your @@align@@ function.
The question we should ask ourselves at this point, is whether the score that it returned represents a real biological signal, or could have arisen by chance. To distinguish between the two hypotheses run the following experiment: Take the human sequence and align it against 20 sequences that were generated at random, and have the same frequency of {A,C,G,T} as the mouse sequence. Compare the average score achieved by aligning the human sequence against random sequences to the score obtained by aligning it against the mouse sequence.
We ask you to capture this procedure using a function called @@random_align@@ which has the following signature:
The question we should ask ourselves at this point, is whether the score that it returned represents a real biological signal, or could have arisen by chance. To distinguish between the two hypotheses run the following experiment: Take the human sequence and align it against 20 sequences that were generated at random, and have the same frequency of {A,C,G,T} as the mouse sequence. Compare the average score achieved by aligning the human sequence against random sequences to the score obtained by aligning it against the mouse sequence.
We ask you to capture this procedure using a function called @@random_align@@ which has the following signature:
Changed line 33 from:
scores = random_align(sequence1, sequence2, num_trials)
to:
scores = HW4.random_align(sequence1, sequence2, num_trials)
Added lines 28-39:
The file contains two sequences, one from the human genome, and another from the mouse genome.
As a first step, align the two sequences using your align method.
The question we should ask ourselves at this point, is whether the score that it returned represents a real biological signal, or could have arisen by chance. To distinguish between the two hypotheses run the following experiment: Take the human sequence and align it against 20 sequences that were generated at random, and have the same frequency of {A,C,G,T} as the mouse sequence. We ask you to capture this procedure using a function called @@random_align@@ which has the following signature:
(:source lang=python:)
scores = random_align(sequence1, sequence2, num_trials)
# sequence1 takes the role of the human sequence from the example
# sequence2 takes the role of the mouse sequence
# num_trials is the number of random sequence to generate and align
(:sourceend:)
In the docstring of your module report on the results of your experiment with the given sequences. Do you think the two sequences are related?
As a first step, align the two sequences using your align method.
The question we should ask ourselves at this point, is whether the score that it returned represents a real biological signal, or could have arisen by chance. To distinguish between the two hypotheses run the following experiment: Take the human sequence and align it against 20 sequences that were generated at random, and have the same frequency of {A,C,G,T} as the mouse sequence. We ask you to capture this procedure using a function called @@random_align@@ which has the following signature:
(:source lang=python:)
scores = random_align(sequence1, sequence2, num_trials)
# sequence1 takes the role of the human sequence from the example
# sequence2 takes the role of the mouse sequence
# num_trials is the number of random sequence to generate and align
(:sourceend:)
In the docstring of your module report on the results of your experiment with the given sequences. Do you think the two sequences are related?
Changed line 27 from:
To get a feel for how a bioinformatics analyst might use such a program consider the following file, that contains two [[path:'../../data/sequences.fasta'|DNA sequences]] in a format known as [[http://en.wikipedia.org/wiki/FASTA_format | FASTA]].
to:
To get a feel for how a bioinformatics analyst might use such a program consider the following file, that contains two [[Path:../../data/sequences.fasta|DNA sequences]] in a format known as [[http://en.wikipedia.org/wiki/FASTA_format | FASTA]].
Changed line 27 from:
To get a feel for how a bioinformatics analyst might use such a program consider the following file, that contains two [[path:../../data/sequences.fasta|DNA sequences]] in a format known as [[http://en.wikipedia.org/wiki/FASTA_format | FASTA]].
to:
To get a feel for how a bioinformatics analyst might use such a program consider the following file, that contains two [[path:'../../data/sequences.fasta'|DNA sequences]] in a format known as [[http://en.wikipedia.org/wiki/FASTA_format | FASTA]].
Changed lines 15-17 from:
# your align method receives two sequences (strings), which you can assume
# are upper-case strings over the alphabet {A,C,G,T}.
# mismatch_penalty is the penalty assigned to a mismatch between two letters
# mismatch_penalty is the penalty assigned to a mismatch between two letters
to:
# your align method receives two sequences (strings), which you can
# assume are upper-case strings over the alphabet {A,C,G,T}.
# mismatch_penalty is the penalty assigned to a mismatch between
# two letters
# assume are upper-case strings over the alphabet {A,C,G,T}.
# mismatch_penalty is the penalty assigned to a mismatch between
# two letters
Changed lines 20-21 from:
# the return value is a tuple whose first element is the optimal alignment, and its
# second element is the score of the alignment.
to:
# the return value is a tuple whose first element is the optimal
# alignment, and its second element is the score of the alignment.
# alignment, and its second element is the score of the alignment.
Added lines 26-27:
To get a feel for how a bioinformatics analyst might use such a program consider the following file, that contains two [[path:../../data/sequences.fasta|DNA sequences]] in a format known as [[http://en.wikipedia.org/wiki/FASTA_format | FASTA]].
Changed lines 7-13 from:
Dijkstra's shortest path algorithm. Your implementation should use a heap-based priority queue to prioritize
Your code should be structured as a single python file called @@HW3.py@@. We will call to test your code as:
to:
Your code should be structured as a single python file called @@HW4.py@@.
First we describe the interface to the sequence alignment method.
First we describe the interface to the sequence alignment method.
Changed lines 11-20 from:
import HW3
# you should implement a method for loading a directed graph
# called load_directed_dot:
g = HW3.load_directed_dot(file_name)
# find a shortest path between start_node and end_node
path = HW3.dijkstra(g, start_node, end_node)
# note that you may use a dictionary to convert the names of
# start_node and end_node to their internal representation as
# node indices. This is the only exception to our "no dictionaries" rule!
# The return value should be a list of nodes that are on the shortest path.
# you should implement a method for loading a directed graph
# called load
g = HW3.load_directed_dot(file_name)
# find a shortest path between start_node and end_node
path = HW3.dijkstra(g, start_node, end_node)
# note that you may use a dictionary to convert the names of
# start_node and end
# node indices. This
# The return value should be a list
to:
>>> import HW4
>>> sequence1 = 'ATTAG'
>>> sequence2 = 'ATAG'
>>> alignment,score = HW4.align(sequence1, sequence2, mismatch_penalty, gap_penalty)
# your align method receives two sequences (strings), which you can assume
# are upper-case strings over the alphabet {A,C,G,T}.
# mismatch_penalty is the penalty assigned to a mismatch between two letters
# gap_penalty is the penalty assigned to a gap
# the return value is a tuple whose first element is the optimal alignment, and its
# second element is the score of the alignment.
# the alignment is formatted as a pair of strings:
>>> print alignment
('ATTAG', 'AT-AG')
>>> sequence1 = 'ATTAG'
>>> sequence2 = 'ATAG'
>>> alignment,score = HW4.align(sequence1, sequence2, mismatch_penalty, gap_penalty)
# your align method receives two sequences (strings), which you can assume
# are upper-case strings over the alphabet {A,C,G,T}.
# mismatch_penalty is the penalty assigned to a mismatch between two letters
# gap_penalty is the penalty assigned to a gap
# the return value is a tuple whose first element is the optimal alignment, and its
# second element is the score of the alignment.
# the alignment is formatted as a pair of strings:
>>> print alignment
('ATTAG', 'AT-AG')
Deleted lines 24-33:
The graph is loaded from a file in @@dot@@ format. In this case we are dealing with a directed graph so an edge is represented as @@node1 -> node2 weight@@, where @@weight@@ is the cost of the edge.
For example, the @@dot@@ formatted file corresponding to the graph in Figure 4.7 (page 139 in the book) is located [[http://www.cs.colostate.edu/~sutton/cs320/figure4_7.dot | here]] and is shown below.
%block text-align=center% %height=300px%http://www.cs.colostate.edu/~sutton/cs320/figure4_7.gif
In this case, the call @@HW3.dijkstra(g,'s','y')@@ would return the list @@['s','u','x','y'].@@
You will need to add a @@change_key@@ method to your heap implementation. Page 65 in the book has a useful hint; as we discussed in class, you will need to @@heapify_up@@ or @@heapify_down@@ after a change key operation, depending on whether the key was increased or decreased.
Added lines 1-35:
!!Programming Assignment 4 - Sequence Alignment
!!!Due: April 15th at 5pm.
In this assignment you will program the sequence alignment algorithm that was presented in class (Section 6.6 in the textbook) and apply it in order to try and determine whether two given DNA sequences, one from the human genome and another from the mouse genome are likely to be evolutionarily related.
Dijkstra's shortest path algorithm. Your implementation should use a heap-based priority queue to prioritize the order of processing unexplored nodes inthe graph; on a given graph with n nodes and m edges, the running time of your implementation of Dijkstra's algorithm should be O(m log n). We will look at the code to verify that your implementation satisfies this bound. So, as before, you can't use Python dictionaries in your graph class, except during reading the graph from the input file.
Your code should be structured as a single python file called @@HW3.py@@. We will call to test your code as:
(:source lang=python:)
import HW3
# you should implement a method for loading a directed graph
# called load_directed_dot:
g = HW3.load_directed_dot(file_name)
# find a shortest path between start_node and end_node
path = HW3.dijkstra(g, start_node, end_node)
# note that you may use a dictionary to convert the names of
# start_node and end_node to their internal representation as
# node indices. This is the only exception to our "no dictionaries" rule!
# The return value should be a list of nodes that are on the shortest path.
(:sourceend:)
The graph is loaded from a file in @@dot@@ format. In this case we are dealing with a directed graph so an edge is represented as @@node1 -> node2 weight@@, where @@weight@@ is the cost of the edge.
For example, the @@dot@@ formatted file corresponding to the graph in Figure 4.7 (page 139 in the book) is located [[http://www.cs.colostate.edu/~sutton/cs320/figure4_7.dot | here]] and is shown below.
%block text-align=center% %height=300px%http://www.cs.colostate.edu/~sutton/cs320/figure4_7.gif
In this case, the call @@HW3.dijkstra(g,'s','y')@@ would return the list @@['s','u','x','y'].@@
You will need to add a @@change_key@@ method to your heap implementation. Page 65 in the book has a useful hint; as we discussed in class, you will need to @@heapify_up@@ or @@heapify_down@@ after a change key operation, depending on whether the key was increased or decreased.
!!!Due: April 15th at 5pm.
In this assignment you will program the sequence alignment algorithm that was presented in class (Section 6.6 in the textbook) and apply it in order to try and determine whether two given DNA sequences, one from the human genome and another from the mouse genome are likely to be evolutionarily related.
Dijkstra's shortest path algorithm. Your implementation should use a heap-based priority queue to prioritize the order of processing unexplored nodes inthe graph; on a given graph with n nodes and m edges, the running time of your implementation of Dijkstra's algorithm should be O(m log n). We will look at the code to verify that your implementation satisfies this bound. So, as before, you can't use Python dictionaries in your graph class, except during reading the graph from the input file.
Your code should be structured as a single python file called @@HW3.py@@. We will call to test your code as:
(:source lang=python:)
import HW3
# you should implement a method for loading a directed graph
# called load_directed_dot:
g = HW3.load_directed_dot(file_name)
# find a shortest path between start_node and end_node
path = HW3.dijkstra(g, start_node, end_node)
# note that you may use a dictionary to convert the names of
# start_node and end_node to their internal representation as
# node indices. This is the only exception to our "no dictionaries" rule!
# The return value should be a list of nodes that are on the shortest path.
(:sourceend:)
The graph is loaded from a file in @@dot@@ format. In this case we are dealing with a directed graph so an edge is represented as @@node1 -> node2 weight@@, where @@weight@@ is the cost of the edge.
For example, the @@dot@@ formatted file corresponding to the graph in Figure 4.7 (page 139 in the book) is located [[http://www.cs.colostate.edu/~sutton/cs320/figure4_7.dot | here]] and is shown below.
%block text-align=center% %height=300px%http://www.cs.colostate.edu/~sutton/cs320/figure4_7.gif
In this case, the call @@HW3.dijkstra(g,'s','y')@@ would return the list @@['s','u','x','y'].@@
You will need to add a @@change_key@@ method to your heap implementation. Page 65 in the book has a useful hint; as we discussed in class, you will need to @@heapify_up@@ or @@heapify_down@@ after a change key operation, depending on whether the key was increased or decreased.
