Question

Algorithm description for matchPWM (Biostrings)

3

Entering edit mode

Yue Li ▴ 370

@yue-li-5245

Last seen 8.3 years ago

USA

Dear List,

I wonder if you could kindly point me to a clear description of the algorithm that matchPWM implementation is based on, preferably a peer-reviewed publication.

I apologize in advance for not being able to see the obvious answer here. But the reference (Wasserman and Sandelin, 2004 Nat. Gen. Rev.) in the matchPWM documentation seems to serve only a general concept of PWM function itself rather than the matchPWM.

The closest paper I could find is MATCH in NAR. Am I correct assume it is the same algorithm for matchPWM?

Thanks much,
Yue

1. Wasserman, W. W. & Sandelin, A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 5, 276–287 (2004).
2. Kel, A. E. MATCHTM: a tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res 31, 3576–3579 (2003).

biostrings matchPWM • 3.3k views

ADD COMMENT • link updated 9.6 years ago by Hervé Pagès 16k • written 9.6 years ago by Yue Li ▴ 370

score 4 · Answer 1 · 2014-09-17

Hi Yue,

The algorithm used by matchPWM() is the most naive/straightforward algo you can think of and thus is not based on any peer-reviewed publication. It walks the subject (DNA sequence) and for each position on the subject it computes the score obtained by positioning the pwm (Position Weight Matrix) at this position. If the score is equal or greater than the threshold specified via the min.score argument then a match is reported at that position. The only place where it tries to be a little bit clever is by bailing out early when computing the score at a given position, that is, when there is no chance that completing the score calculation will give a result that is greater or equal to the threshold. This tends to speed up the algo a little bit.

Hope this clarifies,

H.

score 1 · Answer 2 · 2014-09-17

Exactly.

Each pwm has a "maximal theoretical score", which is the score that will be obtained when the pwm is positioned on top of a subsequence that maximizes the score. For example this pwm:

  pwm <- rbind(A=c(0, 1, 3, 9, 0),
               C=c(8, 4, 0, 0, 5),
               G=c(2, 7, 0, 1, 5),
               T=c(0, 0, 7, 4, 1))

has a "maximal theoretical score" that will be obtained for subsequences CGTAC or CGTAG.

You can get the "maximal theoretical score" for pwm with:

  > maxScore(pwm)
  [1] 36

You can use PWMscoreStartingAt() to position the pwm at arbitrary positions on a sequence:

  > PWMscoreStartingAt(pwm, "CGTACAAACGTAG", starting.at=1:9)
  [1] 36  5 10 16 26  9  3  8 36

When the min.score arg of matchPWM() is specified as a percentage, it's taken as a percentage of this "maximal theoretical score".

Cheers,

H.

score 0 · Answer 3 · 2014-09-17

0

Entering edit mode

Yue Li ▴ 370

@yue-li-5245

Last seen 8.3 years ago

USA

Hi Herve,

Thanks a lot for the quick reply. Since the cutoff is defined as percentage, the actual score for a subsequence is just the sum of the scores corresponding to each nucleotide in that sequence divided by maximum of the total scores (for an optimal subsequence)?

Thanks,

Yue

ADD COMMENT • link 9.6 years ago Yue Li ▴ 370