Main.Biopython-motif History

Hide minor edits - Show changes to markup

April 22, 2010, at 12:45 PM MST by 10.84.44.65 -
Added lines 1-125:

(:source lang=python:)

""" Using the motif module of biopython Note that the python tutorial does not contain information about the Motif package. Documentation is available in the directory Doc/cookbook/motif of the biopython distribution. """

from Bio import Motif

  1. Let's create an instance of the E2F1 motif (downloaded from the
  2. jaspar database):

motif=Motif.read(open("MA0024.1.pfm"), "jaspar-pfm")

  1. the format method displays the motif in a variety of formats:

print motif.format('transfac')

  1. You can also get a "logo" image that represents the motif:

motif.weblogo('e2f1.png')

  1. The length method does what we would expect:

print len(motif)

  1. The jaspar format for a motif specifies a matrix of counts: how many times
  2. each nucleotide occurs in each position of the motif.
  3. This information is stored in the 'counts' instance variable of a motif.
  4. Let's see what design choice the biopython folks did for storing this
  5. information:

motif.counts

  1. The matrix of counts is converted into a matrix of frequencies.
  2. This is called a position weight matrix (PWM).
  3. The PWM associated with a motif can be accessed as:

motif.pwm()

  1. Notice that the pwd assigns a non-zero frequency to every nucleotide at every
  2. position, even if none have been observed. Those are the result of what are
  3. called "pseuodo-counts" which will be explained next in the context of scoring
  4. a sequence with respect to the motif.
  5. The consensus sequence is the mostly like sequence represented by the motif.
  6. It can be generated by:

motif.consensus()

  1. Let's address the question of how well does a sequence matches the motif.
  2. We'll do that by assigning a score that is a sum of lod-odds ratios.
  3. A motif instance has a variable associated with it called _log_oddds:

motif._log_odds

  1. The maximum score that can be achieved for the motif

motif.max_score()

  1. By default the background is assumed to be uniform:

motif.background

def compute_background(file_name) :

    from Bio import SeqIO
    background = {'A':0.0, 'C':0.0, 'G':0.0, 'T':0.0}
    count = 0.0
    for seq_record in SeqIO.parse(file_name, "fasta") :
        seq = seq_record.seq
        for nucl in seq :
            if nucl in background :
                background[nucl] += 1
            count += 1
    for nucl in background :
        background[nucl] /= count
    return background
  1. Let's see that the max score changes when we modify the background:

motif.max_score() motif.background = compute_background('human_promoter500.fasta') motif._log_odds_is_current = False motif.max_score()

  1. Let's look for matches of the motif in DNA sequences

from Bio.Seq import Seq seq = Seq("AATTTCGCGCAA") hits = motif.search_pwm(seq, threshold = 0.7 * max_score)

  1. Note that the search_pwm method returns an iterator.
  2. This iterator returns tuples that contain the position and score of all the hits:

for pos, score in hits :

    print pos, score
  1. Let's now search for motif hits in the promoter regions of human genes

from Bio import Motif motif=Motif.read(open("MA0024.1.pfm"), "jaspar-pfm") max_score = motif.max_score() from Bio import SeqIO for seq_record in SeqIO.parse('human_promoter500.fasta', "fasta") :

    seq = seq_record.seq
    id = seq_record.id
    search_results = motif.search_pwm(seq, threshold=0.7 * max_score)
    for pos, score in search_results :
        print id, pos, score, '+'
  1. We need to understand better what it means when there is a hit
  2. on the negative strand

from Bio import Motif motif=Motif.read(open("MA0024.1.pfm"), "jaspar-pfm") max_score = motif.max_score() from Bio import SeqIO for seq_record in SeqIO.parse('human_promoter500.fasta', "fasta") :

    seq = seq_record.seq
    id = seq_record.id
    rc = seq.reverse_complement()
    search_results = motif.search_pwm(seq, threshold=0.7 * max_score)
    for pos, score in search_results :
        if pos > 0 :
            print id, pos, score, '+'
            print seq[pos:pos + len(motif)]
        else :
            print id, pos, score, '-'
            print seq[-pos : -pos + len(motif)]
    search_results = motif.search_pwm(rc, threshold=0.7 * max_score, both=False)
    for pos, score in search_results :
        print id, pos, score, '-'
        print rc[pos:pos + len(motif)]

(:sourceend:)