Assignment 4

Due date: 10/18 at 10am.

Part 1: Processing Fasta files

Here is a fasta file that contains the sequences of introns in Arabidopsis that are known to be retained. The header line of each sequence contains some information about the intron. For example, the following intron


comes from the gene AT5G11400, and is on the negative strand. The “AT” in the beginning of the identifier is short for Arabidopsis thaliana, and the next digit is the chromosome from which the gene comes (5 in our example). Write a function that that extracts from this file all records that describe introns that are on the positive strand of chromosome 4, i.e. should also be a Fasta file. Your function should have two arguments: the input file name, and the output file name.

Part 2: Processing Fastq files

Next generation sequencing reads are often available in a format called Fastq, which is similar to Fasta. Before mapping the reads in a Fastq file to a genome they are often preprocessed, trimming the reads to eliminate low quality segments. Before describing the task, here is a short description of the Fastq format. A FASTQ file uses four lines p er sequence:

As an example, here are the first four lines from A FASTQ file containing a single sequence might look like this:

@SRR1909019.1 HWI-ST1435:162:HBB58ADXX:1:1101:1322:1969 length=101
+SRR1909019.1 HWI-ST1435:162:HBB58ADXX:1:1101:1322:1969 length=101

As mentioned above, the fourth line encodes the quality of each nucleotide in the sequence. The quality score is a representation of the probability that a base is correct. It is computed according to the equation: $$ Q = -10 log (p), $$ where the logarithm is in base 10. The quality score is encoded using ASCII characters, where the ASCII value represents the quality score. The exact mapping from quality scores to ASCII characters varies somewhat between different versions of the Illumina platform. Illumina reads have error rates that increase along the read (i.e. error rates are longer at the 3' end of the read). Before mapping sequenced reads researchers typically process the reads and trim their 3' end such that the probability error is above some threshold. Sometimes you also need to trim the 5' end due to the presence of adapter sequences.

Your task for this assignment is to write a function called trim_fastq(fastq_input_file, fastq_output_file, first_base, last_base, log_file) that processes the given fastq file (fastq_input_file), and produces an output file whose name is given by the parameter fastq_output_file. In trimming the file first_base is the first base that is to be included on the 5' end, and last_base is the last base that is included on the 3' end. For example, if your reads are length 100 and you want to trim the first 5 and the last 10 nucleotides, you choose 6 as first_base and 90 as last_base. This agrees with how the commonly used trimming program fastx_trimmer is used.

The last parameter, log_file, is a name of a file in which you will output some statistics on the results of trimming: how many reads were processed, and the length of the resulting reads. If the user provides bad inputs for the values of first_base or last_base, the log file should report that with an informative message to the user. Think carefully what cases should be checked for. The idea is that your program would terminate without giving an error message even when given bad inputs. The program should also exit gracefully if the input file does not exist.

Here's a sample fastq file that you can use in writing your program.


Put the two functions in a module called, and follow the template shown in class in writing your code. The “main” segment of the module should be used to test each of the functions. Submit your code via Canvas.