PWMmatch: position weight matrix or position frequency matrix?

0

Entering edit mode

Zuzanna Makowska ▴ 10

@zuzanna-makowska-4493

Last seen 10.6 years ago

Dear List, I have a question regarding the matchPWM function of Biostrings package. The help page for the function states that it requires a position weight matrix as an input. At the same time I found a post on the list giving a following example of the use of the function: (quoting: [BioC] matching transcription factor binding sites Herve Pages hpages at fhcrc.org Sat Apr 19 02:41:03 CEST 2008) Suppose 'pwm' contains a Position Weight Matrix, let's say: pwm <- rbind(A=c( 1, 0, 19, 20, 18, 1, 20, 7), C=c( 1, 0, 1, 0, 1, 18, 0, 2), G=c(17, 0, 0, 0, 1, 0, 0, 3), T=c( 1, 20, 0, 0, 0, 1, 0, 8)) Note that this is just a standard integer matrix with the 4 DNA base letters as row names (having these row names is mandatory). m <- matchPWM(pwm, chr1, min.score="90%") It seems to me that the matrix in this example is a position frequency matrix and not a position weight matrix (the difference between the two is explained nicely in: Applied Bioinformatics for the identification of regulatory elements; WW Wasserman&A Sandelin, Nat Rev Genet, 2004). Could somebody clarify what is the expected input for this function? Thanks in advance, Zuzanna Makowska [[alternative HTML version deleted]]

Transcription Transcription • 2.6k views

ADD COMMENT • link updated 14.1 years ago by Hervé Pagès 16k • written 14.1 years ago by Zuzanna Makowska ▴ 10

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 3.2 years ago

United States

Hi Zuzanna, The input for the matchPWM function can be either a position weight matrix created with the PWM function or a matrix of frequencies. The connection between the two is the type argument in the PWM function. The default for the type argument is "log2probratio" which understandably gives slightly different results than those obtained when using a frequency matrix as input. If type="prob", the results are the same. See ?PWM for details. We use the PWM function with type="prob" to create a position weight matrix, library(Biostrings) sub <- DNAString("AATCGCGTAGGCGTACCGAT") sset <- DNAStringSet(c("AGTT", "ATGC", "AACG", "AATG"), start=c(1,1,1)) m1 <- matchPWM(PWM(sset, type="prob"), sub) Equivalently we can input a frequency matrix created as described on the man page, mset <- rbind(A=c(4, 2, 0,0), C=c(0,0,1,1), G=c(0,1,1,2), T=c(0,1,2,1)) m2 <- matchPWM(mset, sub) identical(m1, m2) Valerie > sessionInfo() R version 2.12.0 Patched (2010-10-18 r53360) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Hsapiens.UCSC.hg18_1.3.16 BSgenome_1.18.2 [3] GenomicFeatures_1.2.3 GenomicRanges_1.2.2 [5] Biostrings_2.18.2 IRanges_1.8.8 loaded via a namespace (and not attached): [1] Biobase_2.10.0 biomaRt_2.6.0 DBI_0.2-5 RCurl_1.5-0 [5] RSQLite_0.9-4 rtracklayer_1.10.6 tools_2.12.0 XML_3.2-0 On 02/17/2011 06:59 AM, Zuzanna Makowska wrote: > Dear List, > > I have a question regarding the matchPWM function of Biostrings package. > > The help page for the function states that it requires a position weight matrix as an input. At the same time I found a post on the list giving a following example of the use of the function: > > (quoting: > [BioC] matching transcription factor binding sites > > Herve Pages hpages at fhcrc.org > > Sat Apr 19 02:41:03 CEST 2008) > > Suppose 'pwm' contains a Position Weight Matrix, let's say: > > pwm <- rbind(A=c( 1, 0, 19, 20, 18, 1, 20, 7), > C=c( 1, 0, 1, 0, 1, 18, 0, 2), > G=c(17, 0, 0, 0, 1, 0, 0, 3), > T=c( 1, 20, 0, 0, 0, 1, 0, 8)) > > Note that this is just a standard integer matrix with the 4 DNA base letters > as row names (having these row names is mandatory). > m <- matchPWM(pwm, chr1, min.score="90%") > > It seems to me that the matrix in this example is a position frequency matrix and not a position weight matrix (the difference between the two is explained nicely in: Applied Bioinformatics for the identification of regulatory elements; WW Wasserman&A Sandelin, Nat Rev Genet, 2004). > > Could somebody clarify what is the expected input for this function? > > Thanks in advance, > > Zuzanna Makowska > > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 14.1 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 4 hours ago

Seattle, WA, United States

Hi Zuzanna, On 02/17/2011 06:59 AM, Zuzanna Makowska wrote: > Dear List, > > I have a question regarding the matchPWM function of Biostrings package. > > The help page for the function states that it requires a position weight matrix as an input. At the same time I found a post on the list giving a following example of the use of the function: > > (quoting: > [BioC] matching transcription factor binding sites > > Herve Pages hpages at fhcrc.org > > Sat Apr 19 02:41:03 CEST 2008) > > Suppose 'pwm' contains a Position Weight Matrix, let's say: > > pwm<- rbind(A=c( 1, 0, 19, 20, 18, 1, 20, 7), > C=c( 1, 0, 1, 0, 1, 18, 0, 2), > G=c(17, 0, 0, 0, 1, 0, 0, 3), > T=c( 1, 20, 0, 0, 0, 1, 0, 8)) > > Note that this is just a standard integer matrix with the 4 DNA base letters > as row names (having these row names is mandatory). > m<- matchPWM(pwm, chr1, min.score="90%") > > It seems to me that the matrix in this example is a position frequency matrix and not a position weight matrix (the difference between the two is explained nicely in: Applied Bioinformatics for the identification of regulatory elements; WW Wasserman&A Sandelin, Nat Rev Genet, 2004). Thanks for the pointer to Wasserman & Sandelin's paper. I confirm that the matchPWM() function expects the input to be a position *weight* matrix. What makes the 'pwm' object above maybe look like a position *frequency* matrix is because, unlike in Wasserman's paper, it contains non-negative integer weights. Furthermore, all the columns sum to the same value: > colSums(pwm) [1] 20 20 20 20 20 20 20 20 I understand that this is indeed misleading. But 'pwm' could also be something like: pwm <- rbind(A=c(0.06, -0.02, 0.30), C=c(0.00, 0.17, 0.00), G=c(0.03, 0.05, 0.12), T=c(0.22, -0.01, 0.08)) It is really treated by the matchPWM() function as a position-specific scoring matrices. You can check this by computing the score for a few given positions: > PWMscoreStartingAt(pwm, DNAString("TTCAA"), starting.at=1:3) [1] 0.21 0.69 0.28 Then, as you can see, matchPWM() returns the match corresponding to the position that produces the highest score: > matchPWM(pwm, DNAString("TTCAA")) Views on a 5-letter DNAString subject subject: TTCAA views: start end width [1] 2 4 3 [TCA] Finally note that the Biostrings package doesn't provide a tool to convert a position frequency matrix (that can be obtained with consensusMatrix) into a position weight matrix. Hope this helps, H. > > Could somebody clarify what is the expected input for this function? > > Thanks in advance, > > Zuzanna Makowska > > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 14.1 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Zuzanna, On 02/22/2011 10:59 AM, Hervé Pagès wrote: [...] > Finally note that the Biostrings package doesn't provide a tool > to convert a position frequency matrix (that can be obtained with > consensusMatrix) into a position weight matrix. More on this and to clarify the role of the PWM() function mentioned by Val. PWM() can be used on a set of short sequences to compute the associated Position Weight Matrix using the Wasserman & Sandelin's approach. As its name suggests, PWM() will always return a PWM, not a PFM. The 'type' argument controls the type of Position Weight Matrix that is returned. The 'prior.params' argument controls the Dirichlet conjugate prior. By this argument is set to c(A=0.25, C=0.25, G=0.25, T=0.25). In the example given by Val, PWM(sset, type="prob") returns a PWM that is just the PFM divided by a constant (this constant being the number of short sequences in the input). So, in that particular case, matchPWM() will give the same result whether you pass it the PFM or the PWM obtained with PWM(sset , type="prob"). (Multiplying the PWM by a constant doesn't affect the output of matchPWM). But this is only a particular situation. It's not true in general that PWM( , type="prob") will return a PWM that is just the PFM divided by a constant. For example it would not be the case anymore if you were using a 'prior.params' vector that contains values that are not all the same. Internally PWM() proceeds in 2 steps: 1. Computes the PFM of the input (i.e. of the set of short sequences). It uses consensusMatrix() for this. 2. Converts this PFM into a PWM using the Wasserman & Sandelin's approach. So far it was not possible for the user to use PWM() to do just 2. I've just added this capability to the devel version of Biostrings (version 2.19.11, will become available via biocLite() in the next 12 hours or so). So now you can do: library(Biostrings) data(HNF4alpha) pfm <- consensusMatrix(HNF4alpha) pwm <- PWM(pfm) which is equivalent to doing PWM(HNF4alpha). This means you can use PWM() on a PFM. You don't need to have access to the short sequences that were used to generate this PFM anymore. Also, having this conversion from PFM to PWM isolated allows the user to have a closer look at it. This is explained in the man page of the updated Biostrings package. Hope this helps. Cheers, H. > sessionInfo() R version 2.13.0 Under development (unstable) (2011-01-08 r53945) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 [5] LC_MONETARY=C LC_MESSAGES=en_US.utf8 [7] LC_PAPER=en_US.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.19.11 IRanges_1.9.25 loaded via a namespace (and not attached): [1] Biobase_2.11.8 tools_2.13.0 > > Hope this helps, > H. > >> >> Could somebody clarify what is the expected input for this function? >> >> Thanks in advance, >> >> Zuzanna Makowska >> >> >> >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 14.1 years ago Hervé Pagès 16k

0

Entering edit mode

3 little additions... (see below) On 02/24/2011 12:14 PM, Hervé Pagès wrote: > Hi Zuzanna, > > On 02/22/2011 10:59 AM, Hervé Pagès wrote: > [...] >> Finally note that the Biostrings package doesn't provide a tool >> to convert a position frequency matrix (that can be obtained with >> consensusMatrix) into a position weight matrix. > > More on this and to clarify the role of the PWM() function mentioned > by Val. > > PWM() can be used on a set of short sequences to compute the associated > Position Weight Matrix using the Wasserman & Sandelin's approach. > As its name suggests, PWM() will always return a PWM, not a PFM. > The 'type' argument controls the type of Position Weight Matrix that > is returned. > The 'prior.params' argument controls the Dirichlet conjugate prior. > By this argument is set to c(A=0.25, C=0.25, G=0.25, T=0.25). ^^^ by default > > In the example given by Val, PWM(sset, type="prob") returns a PWM > that is just the PFM divided by a constant (this constant being > the number of short sequences in the input). Not true that this constant is the number of short sequences in the input. However, it doesn't matter what this constant is... > So, in that particular > case, matchPWM() will give the same result whether you pass it the > PFM or the PWM obtained with PWM(sset , type="prob"). (Multiplying > the PWM by a constant doesn't affect the output of matchPWM). > > But this is only a particular situation. It's not true in general > that PWM( , type="prob") will return a PWM that is just the > PFM divided by a constant. For example it would not be the case > anymore if you were using a 'prior.params' vector that contains > values that are not all the same. Just to be more concrete about this. By just adding 1 sequence to 'sset', things look very different: > sset <- DNAStringSet(c("AGTT", "ATGC", "AACG", "AATG", "CCAA")) > consensusMatrix(sset)[DNA_BASES, ] [,1] [,2] [,3] [,4] A 4 2 1 1 C 1 1 1 1 G 0 1 1 2 T 0 1 2 1 > PWM(sset, type="prob") [,1] [,2] [,3] [,4] A 0.46428571 0.17857143 0.03571429 0.03571429 C 0.03571429 0.03571429 0.03571429 0.03571429 G -0.10714286 0.03571429 0.03571429 0.17857143 T -0.10714286 0.03571429 0.17857143 0.03571429 Note the negative weights! Cheers, H. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 14.1 years ago Hervé Pagès 16k

Login before adding your answer.