A crucial problem in genome assembly is the discovery and correction of misassembly errors in draft genomes. A large number of these errors can occur in draft genomes and add to the cost and time associated with many scientific initiatives, including Genome 10K, the iK5 project, and 1001 Genomes. We develop a method that will enhance the quality of draft genomes by identifying and removing misassembly errors using paired short read sequence data and optical mapping data.
We apply our method to various assemblies of the loblolly pine and Francisella tularensis genomes. Our results demonstrate that we detect more than 54% of extensively misassembled contigs and more than 60% of locally misassembed contigs in an assembly of Francisella tularensis, and between 31% and 100% of extensively misassembled contigs and between 57% and 73% of locally misassembed contigs in the assemblies of loblolly pine.

Team

Related Publications

Misassembly Detection using Paired-End Sequence Reads and Optical Mapping Data
by Martin D. Muggli, Simon J. Puglisi, Roy Ronen, and Christina Boucher. In submission.

Downloads

misSEQuel Application
Identifies misassembled contigs based on paired end read alignment as well as optical map alignment.
Intermediate data from experiments
ART simulated Pine reads

Requirements

misSEQuel requires the following:
  • Java
  • TWIN (which must be in your PATH)
  • Python3 and Biopython
  • BWA: the prep.pl script uses BWA (0.6.1 or greater, find here) in order to align reads to contigs. By default, it looks for the bwa executable in $PATH, unless passed as a parameter using '-b [PATH-TO-BWA]'.

Manual

Comprehensive Front End
misSEQuel builds upon a variety of tools. misSEQuel.py in the root of the distribution is a front end to these tools. A typical invocation would be:
 
python3 misSEQuel/missequel.py --outdir misSEQuel_out --contigs contigs.fasta --opt_map ecoli_XhoI_om  --opt_map ecoli_Swai_om --enzyme XhoI --enzyme SwaI --is_prokaryote --reads1 mc.orig.1.fq --reads2 mc.orig.2.fq

Optical map files (in SOMA 'match' format) and their corresponding enzyme names are assumed to be in respective order (i.e. The first optical map file corresponds to the first enzyme name, etc.)

The names of the options can be accessed via the --help option as follows:

$ python3 misSEQuel/missequel.py --help
Usage: missequel.py [options]

Options:
  -h, --help         show this help message and exit
  --reads1=R1
  --reads2=R2
  --contigs=CONTIGS
  --opt_map=OPT_MAP
  --outdir=OUTDIR
  --enzyme=ENZYMES
  --is_prokaryote
  --verbose

Included Tool Manual

The following tools can be invoked via the missequel.py front end mentioned above. For more specific use cases, their options are described below.
Preprocessing step
Use missequel_prep.pl to preprocess the data by aligning the paired-end reads to the contigs from assembly (using BWA), and creating a directory (prep) that will be input to misSEQuel.
  • missequel_prep.pl [OPTIONS...] -m -r1 [1.fq] -r2 [2.fq] -c [contigs.fa]

    INPUT:
    -r1 FILE paired-end (1) reads from sequencing (FASTQ/FASTA)
    -r2 FILE paired-end (2) reads from sequencing (FASTQ/FASTA)
    -s FILE alignments of all reads to all contigs from assembly (SAM)
    -c FILE contigs from assembly (multi-FASTA)

    OPTIONS
    -o PATH name of output directory [prep]
    -l INT do not refine contigs smaller than INT bases [0]
    -t INT threads to use in BWA alignment [4]
    -h show this helpful help
misSEQuel
misSEQuel processes all the contigs using missequel.jar.
  • java -Xmx12g -jar misSEQuel.jar [OPTIONS..] -miss -A [PREP DIR] -i [INT] -p [INT]

    INPUT:
    -i INT external insert size of paired-end reads (from 'prep.log')
    -A DIR batch mode, on prep-directory DIR (ignores -c,-ap,-as)

    OPTIONS
    -p INT max threads [1]
    -u INT min threads [1]
    -o DIR output directory
    -C FILE config file, for paths to BLAT and blat_wrapper.pl (see README)
    -k INT k-mer size [50]
    -d INT max positional error (Delta) [25]
    -r report changes (slow) for all input-contigs [<30kb]
    -g FILE evaluate refinment using reference genome
    -D debug mode
    -h show this helpful help
Output
misSEQuel outputs all misassembly information
  • misassemblies.txt A list of all the id's of all the contigs that are deemed to be misassembled.
  • breakpoints.txt A list of misassembly the breakpoints for each contig.

Help.

misSEQuel is freely available software for academic use. For nonacademic use, please contact the authors.
Send your questions or comments to sequel [dot] help [at] gmail [dot] com.