Main.Assignment4 History

Hide minor edits - Show changes to markup

March 12, 2013, at 07:26 PM MST by 24.54.128.180 -
Changed lines 44-45 from:
to:

For the conversion from the letters that encode quality scores to integers use the python ord function. For the converse conversion use chr.

Added line 48:
Changed line 52 from:
to:

While developing your code use this example.

March 11, 2013, at 10:16 AM MST by 129.82.44.223 -
Changed line 20 from:

wikipedia article

to:

wikipedia article).

March 11, 2013, at 10:15 AM MST by 129.82.44.223 -
Changed line 53 from:

In addition, each function should include a comment in triple quotes that explains what it does, what kind of input it expects, and what it returns (such comments are used as help messages).

to:

In addition, your function should include a comment in triple quotes that explains what it does, what kind of input it expects, and what it returns (such comments are used as help messages).

March 11, 2013, at 10:15 AM MST by 129.82.44.223 -
Changed lines 52-53 from:

Submit the programs

 using ramct.  At the top of each file put a comment that identifies you and the program (use a multi-line comment using triple quotes).
to:

Submit the program using ramct. At the top of each file put a comment that identifies you and the program (use a multi-line comment using triple quotes).

March 11, 2013, at 10:14 AM MST by 129.82.44.223 -
Changed lines 3-4 from:

Due date: March xx.

to:

Due date: March 29th.

Changed lines 7-10 from:

Next generation sequencing data

Submit the programs using ramct. At the top of each file put a comment that identifies you and the program (use a multi-line comment using triple quotes).

to:

Next generation sequencing reads are available in a format called Fastq, which is similar to the Fasta format that you are already familiar with. Before mapping the reads in a Fastq file to a genome they are usually preprocessed, trimming the reads to eliminate reads or parts of reads that are low quality. Quality is determined by quality scores that are provided with the reads as described below. Your task for this assignment is to write a program that processes Fastq files as described below.

Before describing the assignment, here is a short description of the Fastq format (adapted from the wikipedia article A FASTQ file uses four lines per sequence:

  • Line 1 begins with a '@' character and is followed by a sequence identifier.
  • Line 2 is the raw sequence letters.
  • Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
  • Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

A FASTQ file containing a single sequence might look like this:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

As mentioned above, the fourth line encodes the quality of each nucleotide in the sequence. The quality score is a representation of the probability that a base is correct. It is computed according to the equation:

Q = -10 log (p),

where the logarithm is in base 10. The quality score is encoded using ASCII characters, where the ASCII value represents the quality score. The exact mapping from quality scores to ASCII characters varies somewhat between versions of the Illumina platform. In the data we will consider quality score of 2 is encoded by ASCII 66, which is "B" (quality scores 0 and 1 are not used), 3 is "C" and so on. The Illumina manual states that "If a read ends with a segment of mostly low quality (Q15 or below), then all of the quality values in the segment are replaced with a value of 2 (encoded as the letter B in Illumina's text-based encoding of quality scores)... This Q2 indicator does not predict a specific error rate, but rather indicates that a specific final portion of the read should not be used in further analyses."

Illumina reads have error rates that increase along the read (i.e. error rates are longer at the 3' end of the read). Before mapping sequenced reads researchers typically process the reads and trim their 3' end such that the probability error is above some threshold. Your task for this assignment is to write a function called trim_fastq (fastq_input_file, fastq_output_file, log_file, error_threshold, length_threshold) that processes the given fastq file (fastq_input_file), and produces an output file whose name is given by the parameter fastq_output_file. When trimming, whenever there is a suffix of the sequence whose error probabilities are above the threshold, that suffix is trimmed. If the resulting read has a length less than the parameter length_threshold, that read is discarded. The last parameter, log_file, is a name of a file in which you will output some statistics on the results of trimming: how many reads were processed, how many reads were discarded, how many reads were trimmed, and the average error probability per position.

Submit the programs

 using ramct.  At the top of each file put a comment that identifies you and the program (use a multi-line comment using triple quotes).
March 10, 2013, at 09:27 AM MST by 24.54.128.180 -
Changed lines 3-25 from:

Due date: February 22nd.

String matching with mismatches

The find function we wrote in class determines whether a given string is a substring of another string, and returns the first position where they match. Your task is to write a more general function that allows mismatches to occur. The signature of your function should be:

find_with_mismatches(s, substr, num_mismatches)

You are looking for matches of substr in the string s that have up to num_mismatches mismatches, and you are to return the first index where it occurs, or -1 if it does not. For example: the string CGCT occurs in AGGTCACTAG when you allow for a single mismatch in index 4. In the context of motif finding this is very useful, since patterns (motifs) in DNA or protein sequences do not always occur in exactly the same way. Searching with mismatches allows us to capture this variability. Further assume that your function is receiving a DNA sequence as input, and that positions that are not either A,C,G, or T in the string your are searching in (s) do not constitute matches.

Put your function in a file called find_with_mismatches.py.

Reverse complement

Write a function that receives as input a DNA sequence and computes its reverse complement. For example, the reverse complement of AGTCATG is CATGACT. In computing the reverse complement assume that any character that is not A,C,G, or T is its own complement. Call your function reverse_complement, and put it in a file called reverse_complement.py.

Submit the programs by email to your instructor. At the top of each file put a comment that identifies you and the program (use a multi-line comment using triple quotes).

to:

Due date: March xx.

Processing Fastq files

Next generation sequencing data

Submit the programs using ramct. At the top of each file put a comment that identifies you and the program (use a multi-line comment using triple quotes).

February 19, 2010, at 10:26 PM MST by 71.196.160.210 -
Changed line 11 from:

You are looking for matches of substr in the string s that have up to num_mismatches mismatches, and you are to return the index where it occurs, or -1 if it does not.

to:

You are looking for matches of substr in the string s that have up to num_mismatches mismatches, and you are to return the first index where it occurs, or -1 if it does not.

February 15, 2010, at 12:48 PM MST by 10.84.44.68 -
Changed line 26 from:

In addition, each function should include a useful help message that explains what it does, what kind of input it expects, and what it returns.

to:

In addition, each function should include a comment in triple quotes that explains what it does, what kind of input it expects, and what it returns (such comments are used as help messages).

February 15, 2010, at 12:47 PM MST by 10.84.44.68 -
Added lines 23-26:

Submit the programs by email to your instructor. At the top of each file put a comment that identifies you and the program (use a multi-line comment using triple quotes). In addition, each function should include a useful help message that explains what it does, what kind of input it expects, and what it returns.

February 15, 2010, at 12:39 PM MST by 10.84.44.68 -
Changed lines 3-4 from:

String matching allowing mismatches

to:

Due date: February 22nd.

String matching with mismatches

Changed lines 16-24 from:

Put your function in a file called find_with_mismatches.py.

to:

Put your function in a file called find_with_mismatches.py.

Reverse complement

Write a function that receives as input a DNA sequence and computes its reverse complement. For example, the reverse complement of AGTCATG is CATGACT. In computing the reverse complement assume that any character that is not A,C,G, or T is its own complement. Call your function reverse_complement, and put it in a file called reverse_complement.py.

February 15, 2010, at 12:31 PM MST by 10.84.44.68 -
Added lines 1-14:

Assignment 4

String matching allowing mismatches

The find function we wrote in class determines whether a given string is a substring of another string, and returns the first position where they match. Your task is to write a more general function that allows mismatches to occur. The signature of your function should be:

find_with_mismatches(s, substr, num_mismatches)

You are looking for matches of substr in the string s that have up to num_mismatches mismatches, and you are to return the index where it occurs, or -1 if it does not. For example: the string CGCT occurs in AGGTCACTAG when you allow for a single mismatch in index 4. In the context of motif finding this is very useful, since patterns (motifs) in DNA or protein sequences do not always occur in exactly the same way. Searching with mismatches allows us to capture this variability. Further assume that your function is receiving a DNA sequence as input, and that positions that are not either A,C,G, or T in the string your are searching in (s) do not constitute matches.

Put your function in a file called find_with_mismatches.py.