{
 "metadata": {
  "name": "",
  "signature": "sha256:09cb2864e5ced4d1ac217df508c42b8f76a9459e34e7fd6190bc0bb89422b3d4"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "TWIN Tutorial"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this tutorial, we'll take a reference genome (*E. Coli*), simulate a set of contigs and an optical map, and then use TWIN to align the contigs to the optical map.  While TWIN scales to large genomes, this tutorial will focus on clarity, at times sacrificing efficiency.\n",
      "\n",
      "After most code cells, some portion of the data produced by that cell will be displayed in the hopes of making this easier to follow.\n",
      "\n",
      "Hopefully the simplicity of the python language and the subset of python features used here makes this accessible to everyone.  There are two python feature which may not be transparent however.  One is python's list comprehension feature.  This is a feature for generating a list where each element is the value of an expression.  The syntax is like so:\n",
      "\n",
      "    [expression FOR variable IN iterable]  # where uppercase denotes keywords\n",
      "\n",
      "So this python code:\n",
      "\n",
      "    [ x + x for x in range(10) ]\n",
      "\n",
      "is an expression, which when evaluated, will produce:\n",
      "\n",
      "    [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]\n",
      "\n",
      "The second such feature is that python allows unpacking of tuples in assignment, so (x,y,z) = (1,2,3) will assign one iteger to each of the variables on the left in order.  The parenthesis are optional in many cases.\n",
      "\n",
      "First, we'll import some useful modules"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import Bio\n",
      "import Bio.Restriction\n",
      "import Bio.SeqIO\n",
      "import Bio.SearchIO\n",
      "import random\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "/s/chopin/l/grad/muggli/py3env/lib/python3.3/site-packages/Bio/SearchIO/__init__.py:213: BiopythonExperimentalWarning: Bio.SearchIO is an experimental submodule which may undergo significant changes prior to its future official release.\n",
        "  BiopythonExperimentalWarning)\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ref_genome = \"./ecoli_ref.fa\""
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Creating an Optical Map Input File"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "First, we'll simulate the optical map.  \n",
      "\n",
      "In practice, since *E. Coli* has a circular chromosome, we would want to concatenate two copies of the reference genome together.  This would allow successful alignment for contigs that should align across the point spanning where the circle is broken to form a linear sequence.  (In this scenario, any alignments of contigs where the left end aligns at a point right of the original genome length would be duplicates of another alignment, since the reference sequence was duplicated.)  *To keep things simple, we'll ignore the circular component of /E. Coli/.*\n",
      "\n",
      "We can use Biopython's restriction module to digest the reference genome with the SwaI (https://www.neb.com/products/r0604-swai) enzyme:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "with open(\"ecoli_ref.fa\", \"rU\") as ref_handle:\n",
      "    for ref_seq in Bio.SeqIO.parse(ref_handle, \"fasta\"): # iterate through the all (in this case one) seqs in the fasta file\n",
      "        ref_frag_seqs = Bio.Restriction.SwaI.catalyze(ref_seq.seq) # split that seqence into a tuple of sequences\n",
      "    \n",
      "print(ref_frag_seqs[:2]) # print the first two fragments"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(Seq('AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAG...TTT', SingleLetterAlphabet()), Seq('AAATTAAAATCCATCTTTCAACCTCTTGATATTTTGGGGGTTAATTAATCTTTC...TTT', SingleLetterAlphabet()))\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we'll add and use a function to simulate gaussian noise and remove small fragments:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "random.seed(8675309) # initialize this to a \"random\" number so this notebook is reproducible\n",
      "\n",
      "def noise(lst, small_cutoff, stddev):\n",
      "    return [int(random.gauss(0, stddev)) + s for s in lst if s >= small_cutoff]\n",
      "\n",
      "ref_frag_lengths = [len(ref_frag) for ref_frag in ref_frag_seqs]\n",
      "noisy_frags = noise(ref_frag_lengths, small_cutoff=700, stddev=150)\n",
      "\n",
      "print(noisy_frags)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[40226, 81532, 122840, 1194, 50221, 19636, 27276, 57108, 33243, 2605, 107188, 25540, 15351, 42488, 81284, 141588, 50132, 28308, 57921, 10670, 50153, 17596, 115609, 30380, 8532, 31972, 6119, 33543, 21278, 46756, 31057, 26520, 15478, 69668, 88918, 5889, 2709, 18869, 3227, 9980, 5752, 7733, 36140, 29378, 9736, 90791, 7930, 94844, 2194, 39616, 9094, 34753, 5255, 98464, 101786, 51787, 17789, 29444, 77084, 40071, 31578, 19094, 4456, 4226, 1239, 28256, 203819, 57944, 21107, 65100, 7656, 175914, 75772, 15125, 99637, 29570, 11774, 45392, 41521, 184441, 111685, 9905, 104362, 25395, 17475, 13946, 61524, 8200, 733, 12159, 36378, 3281, 28732, 73220, 8003, 2200, 6392, 8888, 14466, 87525, 41859, 13379, 15020, 63761, 34057, 71288, 55396, 761, 7451, 55237, 40286, 39667, 22059]\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "Then write these noisy fragments to a file, using SOMAv2's match input format (one frag per line, each line containing a tab delimited (size, stddev) pairs):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "\n",
      "def write_frags(fname, frags):\n",
      "    with open(fname, \"w\") as om:\n",
      "        for frag in frags:\n",
      "            # SOMA format uses frag sizes in kb, we don't per frag stddev, so use 0.0\n",
      "            line = str(frag/1000.0) + \"\\t\" + str(0.0) + \"\\n\"\n",
      "            om.write(line) \n",
      "\n",
      "write_frags(\"ecoli_optmap\", noisy_frags)\n",
      "    "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "! ls -l ecoli_optmap"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "-rw------- 1 muggli grad 1207 Dec  3 17:18 ecoli_optmap\r\n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can use the included om2bytes.py script to create a binary file of 32 bit integers (TWIN uses the sdsl-lite library which directly builds the FM-index from the binary file).  \n",
      "\n",
      "The bin size parameter is used to quantize the data.  This should not be necessary in most cases so we'll use a bin size of 1 bp to effectively turn this off. (Quantizing the fragments results in similar suffixes in the optical map being adjacent in a suffix array.  This in turn allows TWIN to match many suffixes concurrently instead of exploring each candidate as s separate branch in the backtracking search described in the paper.)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "! ~/twin/om2bytes.py ecoli_optmap ecoli_optmap.bin 1\n",
      "\n",
      "! ls -l ecoli_optmap*"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "-rw------- 1 muggli grad 1207 Dec  3 17:18 ecoli_optmap\r\n",
        "-rw------- 1 muggli grad  452 Dec  3 17:18 ecoli_optmap.bin\r\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Creating a File of In Silico Digested Contigs"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Since this tutorial assumes just a reference genome, we'll simulate some contigs from the reference sequence, first choose 9 uniformly distributed random endpoints across the genome."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "interior_end_points = [random.randint(0, len(ref_seq)) for i in range(9)]\n",
      "end_points = [0] + interior_end_points + [len(ref_seq)]\n",
      "end_points.sort()\n",
      "\n",
      "print(end_points)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[0, 1186908, 2097533, 2202679, 2665150, 2841819, 2844982, 3929561, 3955024, 4393536, 4639675]\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we'll take consecutive pairs of endpoints (by zipping together this list and a suffix of this list) and write out the corresponding substrings of the reference sequence as a new FASTA file of these simulated (\"fake\") contigs"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "contig_loci = {} # bookkeeping to see how clone the TWIN alignments with accumulated restriction fragment based locus match\n",
      "\n",
      "with open(\"fake_contigs.fa\", \"w\") as contigs_handle:\n",
      "    for interval_num, (start, end) in enumerate(zip(end_points, end_points[1:])):\n",
      "        subseq = ref_seq[start:end]\n",
      "        if random.randint(0,1) == 0:  # reverse complement with 50% probability\n",
      "            subseq.reverse_complement()\n",
      "        print( len(subseq))\n",
      "        subseq.id = \"fake_contig_\" + str(interval_num)\n",
      "        contig_loci[subseq.id] = start\n",
      "        Bio.SeqIO.write(subseq, contigs_handle, \"fasta\")\n",
      "        \n",
      "\n",
      "! grep -n \"^>\" fake_contigs.fa\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1186908\n",
        "910625"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "105146\n",
        "462471\n",
        "176669\n",
        "3163"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "1084579\n",
        "25463"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "438512\n",
        "246139\n",
        "1:>fake_contig_0 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome.\r\n",
        "19784:>fake_contig_1 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome.\r\n",
        "34963:>fake_contig_2 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome.\r\n",
        "36717:>fake_contig_3 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome.\r\n",
        "44426:>fake_contig_4 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome.\r\n",
        "47372:>fake_contig_5 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome.\r\n",
        "47426:>fake_contig_6 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome.\r\n",
        "65504:>fake_contig_7 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome.\r\n",
        "65930:>fake_contig_8 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome.\r\n",
        "73240:>fake_contig_9 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome.\r\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "with open(\"fake_contigs.fa\") as f:\n",
      "    for s in Bio.SeqIO.parse( f, \"fasta\"):\n",
      "        print(len(s))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1186908\n",
        "910625\n",
        "105146\n",
        "462471\n",
        "176669\n",
        "3163\n",
        "1084579"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "25463\n",
        "438512\n",
        "246139\n"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we'll digest the contigs with the included digest.py "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "! ~/twin/digest.py fake_contigs.fa SwaI  fake_contigs.silico\n",
      "\n",
      "! ls -l fake_contigs*"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "-rw------- 1 muggli grad 4717958 Dec  3 17:18 fake_contigs.fa\r\n",
        "-rw------- 1 muggli grad     995 Dec  3 17:18 fake_contigs.silico\r\n"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "*n.b.* You can use fake_contigs.silico and ecoli_optmap (not .bin!) as input to SOMAv2's match executable."
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Running twin"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "What follows is the raw output.  twin2psl.py can be used to convert this to pattern space layout (.psl) format.  The alignment statistics are output in python's dictionary literal format which is suitable for use with python's eval().\n",
      "\n",
      "In the optical map fragments in the alignment pattern are in reverse order to the orientation you would expect because they are coming off the top of a stack maintained during the backtracking backward search.  Backward alignments will appear to match up with the query contig as they are reversed before the search is begun and then reversed again by peeling them off the stack.  Asterisks mark *in silico* contig fragments for which there is no corresponding fragment in the optical map."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "40372 - 40226"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 13,
       "text": [
        "146"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "! ~/twin/twin --opt_map ecoli_optmap.bin --silico fake_contigs.silico --emit_alignment_pattern |tee twin.stdout"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "F value was set to 4.\r\n",
        "search radius value was set to 1000.\r\n",
        "FM-Index indexed 113 elements.\r\n",
        "Constructing suffix array...\r\n",
        "Attempting to mmap 448 elements from ecoli_optmap.bin\r\n",
        "loaded 452 bytes from optical map for sa purposes\r\n",
        "constructing iBWT LUT...\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Done constructing iBWT LUT.\r\n",
        "Loading contigs...\r\n",
        "Done loading contigs.\r\n",
        "Processing 8 in-silico digested contigs...\r\n",
        "processing contig 0\r\n",
        "Matching contig fake_contig_1:(ignored 23313) 8494 32178 6147 381 33256 21211 46904 31191 26548 15499 69671 88912 5684 2995 18866 3209 9946 5696 7890 36218 29300 9849 90849 8071 94857 2354 39876 9124 35005 5183 (ignored 91868) \r\n",
        "Alignment pattern: 5255\t34753\t9094\t39616\t2194\t94844\t7930\t90791\t9736\t29378\t36140\t7733\t5752\t9980\t3227\t18869\t2709\t5889\t88918\t69668\t15478\t26520\t31057\t46756\t21278\t33543\t*\t6119\t31972\t8532\t\r\n",
        "Alignment stats: {'fragnum' : 24, 'locus' : 1210089, 'fval' : 1.54994, 'chi_square_sum' : 23.8616, 'deviation_sum' : 2980, 'num_matched_frags' : 29, 'chi_square_cdf' : 0.264274}\r\n",
        "backward alignment:\r\n",
        " threshold: 10000\r\n",
        "Matching contig fake_contig_0:(ignored 40372) 81429 122993 1449 50262 19426 27340 57193 33258 2668 106869 25515 15428 42593 81181 141556 50372 28516 58088 10436 50207 17283 115590 (ignored 6822) \r\n",
        "Alignment pattern: 115609\t17596\t50153\t10670\t57921\t28308\t50132\t141588\t81284\t42488\t15351\t25540\t107188\t2605\t33243\t57108\t27276\t19636\t50221\t1194\t122840\t81532\t\r\n",
        "Alignment stats: {'fragnum' : 1, 'locus' : 40226, 'fval' : 0.240206, 'chi_square_sum' : 25.603, 'deviation_sum' : 2885, 'num_matched_frags' : 22, 'chi_square_cdf' : 0.730882}\r\n",
        "backward alignment:\r\n",
        " threshold: 10000\r\n",
        "Matching contig fake_contig_6:(ignored 13113) 7446 175804 75774 15015 99569 29752 11632 45454 41590 184238 111696 9620 104361 25244 17760 13830 61572 8186 798 12243 (ignored 19882) \r\n",
        "Alignment pattern: 12159\t733\t8200\t61524\t13946\t17475\t25395\t104362\t9905\t111685\t184441\t41521\t45392\t11774\t29570\t99637\t15125\t75772\t175914\t7656\t\r\n",
        "Alignment stats: {'fragnum' : 70, 'locus' : 2857063, 'fval' : 0.897409, 'chi_square_sum' : 17.2729, 'deviation_sum' : 2218, 'num_matched_frags' : 20, 'chi_square_cdf' : 0.364811}\r\n",
        "backward alignment:\r\n",
        " threshold: 10000\r\n",
        "Matching contig fake_contig_3:(ignored 3070) 51897 17730 29482 76754 39907 31549 19134 472 4454 4352 1175 28145 (ignored 154350) \r\n",
        "Alignment pattern: 28256\t1239\t4226\t4456\t*\t19094\t31578\t40071\t77084\t29444\t17789\t51787\t\r\n",
        "Alignment stats: {'fragnum' : 55, 'locus' : 2204070, 'fval' : 0.894484, 'chi_square_sum' : 8.33596, 'deviation_sum' : 1073, 'num_matched_frags' : 11, 'chi_square_cdf' : 0.317069}\r\n",
        "backward alignment:\r\n",
        " threshold: 10000\r\n",
        "Matching contig fake_contig_8:(ignored 23155) 73260 7831 2129 6393 8817 14379 87592 41932 13256 15058 63691 34277 (ignored 46742) \r\n",
        "Alignment pattern: 34057\t63761\t15020\t13379\t41859\t87525\t14466\t8888\t6392\t2200\t8003\t73220\t\r\n",
        "Alignment stats: {'fragnum' : 93, 'locus' : 3977640, 'fval' : 0.298298, 'chi_square_sum' : 5.71231, 'deviation_sum' : 1033, 'num_matched_frags' : 12, 'chi_square_cdf' : 0.0701158}\r\n",
        "backward alignment:\r\n",
        " threshold: 10000\r\n",
        "Matching contig fake_contig_9:(ignored 24292) 55651 663 891 7295 55369 40318 39581 (ignored 22079) \r\n",
        "Alignment pattern: 39667\t40286\t55237\t7451\t*\t761\t55396\t\r\n",
        "Alignment stats: {'fragnum' : 106, 'locus' : 4417698, 'fval' : 0.215011, 'chi_square_sum' : 5.54707, 'deviation_sum' : 759, 'num_matched_frags' : 6, 'chi_square_cdf' : 0.524214}\r\n",
        "Alignment pattern: 39667\t40286\t55237\t7451\t761\t*\t55396\t\r\n",
        "Alignment stats: {'fragnum' : 106, 'locus' : 4417698, 'fval' : 0.835548, 'chi_square_sum' : 5.87133, 'deviation_sum' : 791, 'num_matched_frags' : 6, 'chi_square_cdf' : 0.562243}\r\n",
        "backward alignment:\r\n",
        " threshold: 10000\r\n",
        "Matching contig fake_contig_4:(ignored 49435) 57747 21022 (ignored 48465) \r\n",
        "Alignment pattern: 21107\t57944\t\r\n",
        "Alignment stats: {'fragnum' : 67, 'locus' : 2712913, 'fval' : 1.32936, 'chi_square_sum' : 2.04596, 'deviation_sum' : 282, 'num_matched_frags' : 2, 'chi_square_cdf' : 0.640477}\r\n",
        "backward alignment:\r\n",
        " threshold: 10000\r\n",
        "Matching contig fake_contig_7:(ignored 16464) 3498 (ignored 5501) \r\n",
        "Alignment pattern: 3227\t\r\n",
        "Alignment stats: {'fragnum' : 38, 'locus' : 1617397, 'fval' : 1.80667, 'chi_square_sum' : 3.26404, 'deviation_sum' : 271, 'num_matched_frags' : 1, 'chi_square_cdf' : 0.929186}\r\n",
        "Alignment pattern: 3281\t\r\n",
        "Alignment stats: {'fragnum' : 91, 'locus' : 3945627, 'fval' : 1.44667, 'chi_square_sum' : 2.09284, 'deviation_sum' : 217, 'num_matched_frags' : 1, 'chi_square_cdf' : 0.85201}\r\n",
        "backward alignment:\r\n",
        "Alignment pattern: 3227\t\r\n",
        "Alignment stats: {'fragnum' : 38, 'locus' : 1617397, 'fval' : 1.80667, 'chi_square_sum' : 3.26404, 'deviation_sum' : 271, 'num_matched_frags' : 1, 'chi_square_cdf' : 0.929186}\r\n",
        "Alignment pattern: 3281\t\r\n",
        "Alignment stats: {'fragnum' : 91, 'locus' : 3945627, 'fval' : 1.44667, 'chi_square_sum' : 2.09284, 'deviation_sum' : 217, 'num_matched_frags' : 1, 'chi_square_cdf' : 0.85201}\r\n",
        " threshold: 10000\r\n",
        "placed somewhere: 8 total alignments: 12\r\n",
        "attempts: 8 skipped:0\r\n"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we can use twin2psl to convert this data to BLAT's Pattern Space Layout (.psl) format.  It takes three arguments:  \n",
      "1. TWIN's output\n",
      "2. The binary optical map file\n",
      "3. The desired output file."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "! ~/twin/twin2psl.py twin.stdout ecoli_optmap.bin contig_alignments.psl"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "/s/chopin/l/grad/muggli/.local/lib/python2.7/site-packages/Bio/SearchIO/__init__.py:213: BiopythonExperimentalWarning: Bio.SearchIO is an experimental submodule which may undergo significant changes prior to its future official release.\r\n",
        "  BiopythonExperimentalWarning)\r\n"
       ]
      }
     ],
     "prompt_number": 21
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we'll explore these results using biopython's PSL processing facilities.  It groups single alignments first by target name, then these groups that share the same target name are grouped by query name.  To traverse the actual alignments, we need three nested loops:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "with open(\"contig_alignments.psl\") as psl:\n",
      "    qresults = Bio.SearchIO.parse(psl, 'blat-psl')\n",
      "\n",
      "    for qresult in qresults: # A qresult groups together all PSL lines with the the same Q_NAME    \n",
      "        print( \"*\", qresult.id)\n",
      "        for hit in qresult.hits: # A hit is a group of PSL lines that all share the same T_NAME\n",
      "            print (\"    *\", hit.id)\n",
      "            for hsp in hit.hsps: # A High Scoring Pair (HSP) represents a line in a PSL\n",
      "                print(\"        *\", hsp.hit_start_all[0]) # take the position of the first (in our case only) match interval\n",
      "                print(\"        -\", contig_loci[qresult.id])\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "* fake_contig_1\n",
        "    * ecoli_optmap.bin\n",
        "        * 1186776\n",
        "        - 1186908\n",
        "* fake_contig_0\n",
        "    * ecoli_optmap.bin\n",
        "        * -146\n",
        "        - 0\n",
        "* fake_contig_6\n",
        "    * ecoli_optmap.bin\n",
        "        * 2843950\n",
        "        - 2844982\n",
        "* fake_contig_3\n",
        "    * ecoli_optmap.bin\n",
        "        * 2201000\n",
        "        - 2202679\n",
        "* fake_contig_8\n",
        "    * ecoli_optmap.bin\n",
        "        * 3954485\n",
        "        - 3955024\n",
        "* fake_contig_9\n",
        "    * ecoli_optmap.bin\n",
        "        * 4393406\n",
        "        - 4393536\n",
        "        * 4393406\n",
        "        - 4393536\n",
        "* fake_contig_4\n",
        "    * ecoli_optmap.bin\n",
        "        * 2663478\n",
        "        - 2665150\n",
        "* fake_contig_7\n",
        "    * ecoli_optmap.bin\n",
        "        * 1600933\n",
        "        - 3929561\n",
        "        * 3929163\n",
        "        - 3929561\n",
        "        * 1600933\n",
        "        - 3929561\n",
        "        * 3929163\n",
        "        - 3929561\n"
       ]
      }
     ],
     "prompt_number": 22
    }
   ],
   "metadata": {}
  }
 ]
}