Assignment 4
Due date: March 29th.
Processing Fastq files
Next generation sequencing reads are available in a format called Fastq, which is similar to the Fasta format that you are already familiar with. Before mapping the reads in a Fastq file to a genome they are usually preprocessed, trimming the reads to eliminate reads or parts of reads that are low quality. Quality is determined by quality scores that are provided with the reads as described below. Your task for this assignment is to write a program that processes Fastq files as described below.
Before describing the assignment, here is a short description of the Fastq format (adapted from the wikipedia article). A FASTQ file uses four lines per sequence:
- Line 1 begins with a '@' character and is followed by a sequence identifier.
- Line 2 is the raw sequence letters.
- Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
- Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
A FASTQ file containing a single sequence might look like this:
@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
As mentioned above, the fourth line encodes the quality of each nucleotide in the sequence. The quality score is a representation of the probability that a base is correct. It is computed according to the equation:
Q = -10 log (p),
where the logarithm is in base 10. The quality score is encoded using ASCII characters, where the ASCII value represents the quality score. The exact mapping from quality scores to ASCII characters varies somewhat between versions of the Illumina platform. In the data we will consider quality score of 2 is encoded by ASCII 66, which is "B" (quality scores 0 and 1 are not used), 3 is "C" and so on.
The Illumina manual states that "If a read ends with a segment of mostly low quality (Q15 or below), then all of the quality values in the segment are replaced with a value of 2 (encoded as the letter B in Illumina's text-based encoding of quality scores)... This Q2 indicator does not predict a specific error rate, but rather indicates that a specific final portion of the read should not be used in further analyses."
For the conversion from the letters that encode quality scores to integers use the python ord function. For the converse conversion use chr.
Illumina reads have error rates that increase along the read (i.e. error rates are longer at the 3' end of the read). Before mapping sequenced reads researchers typically process the reads and trim their 3' end such that the probability error is above some threshold.
Your task for this assignment is to write a function called trim_fastq (fastq_input_file, fastq_output_file, log_file, error_threshold, length_threshold) that processes the given fastq file (fastq_input_file), and produces an output file whose name is given by the parameter fastq_output_file.
When trimming, whenever there is a suffix of the sequence whose error probabilities are above the threshold, that suffix is trimmed. If the resulting read has a length less than the parameter length_threshold, that read is discarded.
The last parameter, log_file, is a name of a file in which you will output some statistics on the results of trimming: how many reads were processed, how many reads were discarded, how many reads were trimmed, and the average error probability per position.
While developing your code use this example.
Submit the program using ramct. At the top of each file put a comment that identifies you and the program (use a multi-line comment using triple quotes). In addition, your function should include a comment in triple quotes that explains what it does, what kind of input it expects, and what it returns (such comments are used as help messages).
