User Tools

Site Tools


assignments:assignment2

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
assignments:assignment2 [2016/03/28 17:16]
asa
assignments:assignment2 [2016/09/01 09:29] (current)
Line 1: Line 1:
 ====== Assignment 2 ====== ====== Assignment 2 ======
  
-Due date:  ​12/at 9pm+Due date:  ​4/11 at 9pm
  
 === Part 1:  Processing Fasta files === === Part 1:  Processing Fasta files ===
  
-Fasta file manipulation. You are given a file in Fasta format. Write a function that converts the sequences in the file to uppercase. Your function should have two arguments: the input file name, and the output file name.+Fasta file manipulation. You are given a file in Fasta format. Write a function that converts the sequences in the file to uppercase ​(headers should remain as they are). Your function should have two arguments: the input file name, and the output file name.
  
 === Part 2:  Processing Fastq files === === Part 2:  Processing Fastq files ===
  
-Next generation sequencing reads are available in a format called Fastq, which is similar to the Fasta format that you are already familiar with. Before mapping the reads in a Fastq file to a genome they are usually preprocessed,​ trimming the reads to eliminate reads or parts of reads that are low quality. ​+Next generation sequencing reads are available in a format called ​**Fastq**, which is similar to the Fasta format that you are already familiar with. Before mapping the reads in a Fastq file to a genome they are usually preprocessed,​ trimming the reads to eliminate reads or parts of reads that are low quality. ​
 Before describing the task, here is a short description of the Fastq format. ​ A FASTQ file uses four lines per sequence: Before describing the task, here is a short description of the Fastq format. ​ A FASTQ file uses four lines per sequence:
  
Line 26: Line 26:
 Q = -10 log (p), Q = -10 log (p),
 where the logarithm is in base 10. The quality score is encoded using ASCII characters, where the ASCII value represents the quality score. The exact mapping from quality scores to ASCII characters varies somewhat between versions of the Illumina platform. where the logarithm is in base 10. The quality score is encoded using ASCII characters, where the ASCII value represents the quality score. The exact mapping from quality scores to ASCII characters varies somewhat between versions of the Illumina platform.
-Illumina reads have error rates that increase along the read (i.e. error rates are longer at the 3' end of the read). Before mapping sequenced reads researchers typically process the reads and trim their 3' end such that the probability error is above some threshold. ​ Sometimes you also need to trim the beginning ​due to the presence of adapter sequences.+Illumina reads have error rates that increase along the read (i.e. error rates are longer at the 3' end of the read). Before mapping sequenced reads researchers typically process the reads and trim their 3' end such that the probability error is above some threshold. ​ Sometimes you also need to trim the 5' end due to the presence of adapter sequences.
  
 Your task for this assignment is to write a function called ''​trim_fastq(fastq_input_file,​ fastq_output_file,​ first_base, last_base, log_file)''​ that processes the given fastq file (''​fastq_input_file''​),​ and produces an output file whose name is given by the parameter ''​fastq_output_file''​. ​ Your task for this assignment is to write a function called ''​trim_fastq(fastq_input_file,​ fastq_output_file,​ first_base, last_base, log_file)''​ that processes the given fastq file (''​fastq_input_file''​),​ and produces an output file whose name is given by the parameter ''​fastq_output_file''​. ​
 In trimming the file ''​first_base''​ is the first base that is to be included on the 5' end, and ''​last_base''​ is the last base that is included on the 3' end. In trimming the file ''​first_base''​ is the first base that is to be included on the 5' end, and ''​last_base''​ is the last base that is included on the 3' end.
 For example, if your reads are length 100 and you want to trim the first 5 and the last 10 nucleotides,​ you choose 6 as ''​first_base''​ and 90 as ''​last_base''​. ​ This agrees with how the commonly used trimming program ''​fastx_trimmer''​ is used. For example, if your reads are length 100 and you want to trim the first 5 and the last 10 nucleotides,​ you choose 6 as ''​first_base''​ and 90 as ''​last_base''​. ​ This agrees with how the commonly used trimming program ''​fastx_trimmer''​ is used.
 +The last parameter, ''​log_file'',​ is a name of a file in which you will output some statistics on the results of trimming: how many reads were processed, and the length of the resulting reads.
 +If the user provides bad inputs for the values of ''​first_base''​ or ''​last_base'',​ the log file should report that with an informative message to the user.  Think carefully what cases should be checked for.  The idea is that your program would terminate without giving an error message even when giving bad inputs.
 +
 +
 +===== Submission =====
 +
 +Put the two functions in a module called ''​assignment2.py'',​ and follow the [[wiki:​template|template]] shown in class in writing your code.  The "​main"​ segment of the module should be used to test each of the functions.
 +
 +
 +Submit your code via assignment P2 in Canvas.
  
-The last parameter, log_file, is a name of a file in which you will output some statistics on the results of trimming: how many reads were processed, ​ 
assignments/assignment2.txt ยท Last modified: 2016/09/01 09:29 (external edit)