Algorithm description for matchPWM (Biostrings)
3
3
Entering edit mode
Yue Li ▴ 370
@yue-li-5245
Last seen 8.9 years ago
USA

Dear List,

I wonder if you could kindly point me to a clear description of the algorithm that matchPWM implementation is based on, preferably a peer-reviewed publication.

I apologize in advance for not being able to see the obvious answer here. But the reference (Wasserman and Sandelin, 2004 Nat. Gen. Rev.) in the matchPWM documentation seems to serve only a general concept of PWM function itself rather than the matchPWM.

The closest paper I could find is MATCH in NAR. Am I correct assume it is the same algorithm for matchPWM?

Thanks much,
Yue

1.    Wasserman, W. W. & Sandelin, A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 5, 276–287 (2004).
2.    Kel, A. E. MATCHTM: a tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res 31, 3576–3579 (2003).

biostrings matchPWM • 3.5k views
ADD COMMENT
4
Entering edit mode
@herve-pages-1542
Last seen 17 hours ago
Seattle, WA, United States

Hi Yue,

The algorithm used by matchPWM() is the most naive/straightforward algo you can think of and thus is not based on any peer-reviewed publication. It walks the subject (DNA sequence) and for each position on the subject it computes the score obtained by positioning the pwm (Position Weight Matrix) at this position. If the score is equal or greater than the threshold specified via the min.score argument then a match is reported at that position. The only place where it tries to be a little bit clever is by bailing out early when computing the score at a given position, that is, when there is no chance that completing the score calculation will give a result that is greater or equal to the threshold. This tends to speed up the algo a little bit.

Hope this clarifies,

H.

 

ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 17 hours ago
Seattle, WA, United States

Exactly.

Each pwm has a "maximal theoretical score", which is the score that will be obtained when the pwm is positioned on top of a subsequence that maximizes the score. For example this pwm:

  pwm <- rbind(A=c(0, 1, 3, 9, 0),
               C=c(8, 4, 0, 0, 5),
               G=c(2, 7, 0, 1, 5),
               T=c(0, 0, 7, 4, 1))

has a "maximal theoretical score" that will be obtained for subsequences CGTAC or CGTAG.

You can get the "maximal theoretical score" for pwm with:

  > maxScore(pwm)
  [1] 36

 

You can use PWMscoreStartingAt() to position the pwm at arbitrary positions on a sequence:

  > PWMscoreStartingAt(pwm, "CGTACAAACGTAG", starting.at=1:9)
  [1] 36  5 10 16 26  9  3  8 36

 

When the min.score arg of matchPWM() is specified as a percentage, it's taken as a percentage of this "maximal theoretical score".

Cheers,

H.

 

 

ADD COMMENT
0
Entering edit mode
Yue Li ▴ 370
@yue-li-5245
Last seen 8.9 years ago
USA

Hi Herve,

Thanks a lot for the quick reply. Since the cutoff is defined as percentage, the actual score for a subsequence is just the sum of the scores corresponding to each nucleotide in that sequence divided by maximum of the total scores (for an optimal subsequence)?

Thanks,

Yue

 

ADD COMMENT

Login before adding your answer.

Traffic: 761 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6