Projects
Projects this semester will focus on tools for spliced alignment of short read data. Each student will choose one of the following programs to look at:
- MapSplice (Jeremy)
K. Wang et al. MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery. Nucl. Acids Res. (2010) 38(18): e178. - Gsnap (Zhisheng)
T.D. Wu and S. Nacu. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics (2010) 26: 873-881. - palmapper (Fayyaz)
De Bona, F. et al., Optimal spliced alignments of short sequence reads. ECCB08/Bioinformatics, 24 (16):i174, 2008. - TopHat (Jeremy)
Trapnell C, Pachter L, and Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (2009) 25 (9): 1105-1111. - SpliceMap (Mo)
Kin Fai Au, Hui Jiang, Lan Lin, Yi Xing, and Wing Hung Wong. Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic Acids Research (2010). - SOAPsplice (Arpita)
Huang S, Zhang J, Li R, Zhang W, He Z, Lam T-W, Peng Z and Yiu S-M. SOAPsplice: genome-wide ab initio detection of splice junctions from RNA-Seq data. Frontiers in Genomic Assay Technology (2010) 2:46. - BWA (Nathan)
Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 2010;26(5):589-95. - RUM (Indika)
Gregory R. Grant, Michael H. Farkas, Angel D. Pizarro, Nicholas F. Lahens, Jonathan Schug, Brian P. Brunk, Christian J. Stoeckert, John B. Hogenesch, and Eric A. Pierce. Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM). Bioinformatics (2011) 27(18): 2518-2528.
During the course of the project you will:
- Present the method in class.
- Apply the program to simulated short-read data that we will provide.
- Write a report that describes your experience with the program and present your findings to the class. These will be due during the last week of classes.
Data
We ask you to apply your chosen program to the following two datasets:
- Short read simulated data from Arabidopsis thaliana. Here's a readme.
- Short read data generated by our collaborator from the biology department. The data is available from the NCBI short-read archive as GEO accession GSE32318. Note that this data is composed of two replicates that you need to align separately. The link for downloading the data is at the bottom of the page, labeled as supplementary file download.
New data
- We have more simulated data - 14 million paired end reads. There are two files: set1 and set2. Start with aligning them individually, and see if using them as paired end data improves accuracy. Here's a file that provides the cigar strings for the reads.
- For testing purposes, here are two files that provide a list of splice junctions generated from EST alignments and from curated gene models. [ EST junctions ], [ annotated junctions ].
- For comparison, here the alignments produced by tophat for the first set of reads, and the second set of reads, and by mapsplice for the first set of reads, and for the second set of reads.
For aligning the datasets you will need the sequence of the Arabidiopsis genome. You can download these from the TAIR website. You will need the sequences for chromosomes 1-5.
Presentation schedule:
- Tuesday 10/11 Jeremy
- Thursday 10/13 Fayyaz and Mo
- Tuesday 10/18 Nathan and Arpita
- Thursday 10/20 Zhisheng and Indika
