CS548 project - Short-read spliced alignment benchmarking ========================================================== Simulated data format description (produced by Jeremy Hyrkas) The MAQ simulator was used to generate the simulated short reads. The simulator was provided with sequences of all known Arabidopsis thaliana splice forms (extracted form TAIR 10 genome annotations). Roughly three million short reads of length 75 were generated. The mutation rate and confidence scores were based on real short read data from Dr. ASN Reddy's lab. The file is in fastq format, and the headers contain information about the genomic coordinates where a read orginated, and how the read should be aligned. A header has the format: @Chr#_pos_CIGAR where Chr# is the chromosome number, pos is the genomic coordinate where the read starts, and CIGAR is the cigar-style string that describes the alignment. For example: @Chr1_5046_75M, or @Chr3_10475_12M200N63M If a read came from a region in the transcript that contained no splice junctions, the CIGAR string is 75M (75 nucleotides matched). If one or more splice junctions were crossed, the CIGAR string reflects the splice junctions. For example, the CIGAR string 15M46N60M represents a read that crosses a splice junction with 15 nucleotides from one end and 60 on the other, with an intron of length 46 in the middle. The CIGAR string 13M34N50M46N12M represents a read that crossed two splice junctions, where the 50 matches in the middle of the read represent a whole exon. When aligning the reads, if the program you are using does not produce SAM format files, please convert the alignments to this format. In the SAM format specification, the first column for an alignment is the name of the read from the original FASTQ file. The third column is the sequence to which a read aligned. The fourth column is the position in which the read mapped, and the sixth column is the CIGAR string for the mapping. To determine whether a simulated read was mapped correctly, you can simply pull the correct chromosome number, position, and CIGAR string from the header and compare to the values given by the alignment. For example, see the following alignments from MapSplice: Chr1_5029_67M78N8M 0 Chr1 5029 255 67M78N8M (other columns here) is a correctly aligned read, whereas the following read was aligned incorrectly: Chr4_2718888_24M456N51M 16 Chr4 1279033 86 51M228N24M (other columns here) The new set of simulated reads (reads1, reads2) are paired-end reads. These reads were generated using the SimSeq tool. An error profile was generated based on the alignment of short-read data from rice. Using this error profile, short reads were generated from transcripts of the known splice forms of arabidopsis. From these reads, the correct alignments to the arabidopsis genome were calculated. The paired end reads feature the same read names, differing only in the final character (i.e. at_sim_1/1 and at_sim_1/2). The true_alignments.txt file contains the correct chromosome, position and cigar string for each read when aligned to the arabidopsis genome.