Creat PWM from Alignment with Gaps
1
1
Entering edit mode
@herve-pages-1542
Last seen 1 hour ago
Seattle, WA, United States
Hi Lax, On 09/25/2013 09:52 AM, Lakshmanan Iyer wrote: > In Bioconductor, Is it possible to create a PWM from DNA sequence alignment > with gaps to be used to scan sequences? > > When I try it as suggested in "Biostrings" it gives the following error: > >> pwm <- PWM(pfm) > Error in .normargPfm(x) : > invalid PFM 'x': IUPAC ambiguity letters are represented > > The help for "PWM" does say "with no IUPAC ambiguity letters" and Gaps are > ambiguity letters. Gaps in alignments are not represented with ambiguity letters but with the "-" letter: > library(BSgenome.Dmelanogaster.UCSC.dm3) > pairwiseAlignment("ACCACTATCCATTACGGGCCCAGT", unmasked(Dmelanogasterchr2L), type="global-local") Global-Local PairwiseAlignmentsSingleSubject (1 of 1) pattern: [1] ACCACTATCCATTACGGGCCCAGT subject: [1216159] ACCACTAT-CATTACGG--CCAGT score: 9.616879 So let's say you want to compute the consensus matrix from a collection of DNA sequences that contain alignment gaps: > dna <- DNAStringSet(c("CTAATGG--CCAGT", "AT-CATTG-CCAGA", "CTATCAACGG--AG")) > cm <- consensusMatrix(dna) > dim(cm) [1] 17 14 > rownames(cm) [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" If you try to pass this matrix directly to PWM(), it will be rejected with the error you reported because PWM() only accepts a position frequency matrix with 4 rows: 1 for each DNA base letter. > PWM(cm) Error in .normargPfm(x) : invalid PFM 'x': IUPAC ambiguity letters are represented I realize the error message is mis-leading (because "-" is not an ambiguity letter) and I'm going to modify it to make it a little bit more informative. To work around this, one might want to subset 'cm' to keep only the A, C, G, and T rows before passing it to PWM(). Unfortunately that doesn't work either because the Wasserman & Sandelin algo used internally by PWM() to compute a Position Weight Matrix from a Position Frequency Matrix expects the latter to have all its columns sum up to the same value: > PWM(cm[DNA_BASES, ]) Error in .normargPfm(x) : invalid PFM 'x': all columns in 'x' must sum to the same value. If the PFM was obtained by calling consensusMatrix() on a DNAStringSet object, please make sure that this object is rectangular (i.e. has a constant width). To me this sounds like a legitimate situation for bypassing the pfm-to-pwm transformation and use the pfm directly as a pwm: pwm <- cm[DNA_BASES, ] matchPWM(pwm, subject) Hope this helps, H. > >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] Biostrings_2.28.0 IRanges_1.18.3 BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] stats4_3.0.1 tools_3.0.1 > > -best > -Lax > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 Alignment Cancer Alignment Cancer • 2.0k views ADD COMMENT 0 Entering edit mode @lakshmanan-iyer-1829 Last seen 6.1 years ago United States Thanks Herve That work around worked.... -best -Lax On Tue, Oct 1, 2013 at 2:53 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Lax, > > > On 09/25/2013 09:52 AM, Lakshmanan Iyer wrote: > >> In Bioconductor, Is it possible to create a PWM from DNA sequence >> alignment >> with gaps to be used to scan sequences? >> >> When I try it as suggested in "Biostrings" it gives the following error: >> >> pwm <- PWM(pfm) >>> >> Error in .normargPfm(x) : >> invalid PFM 'x': IUPAC ambiguity letters are represented >> >> The help for "PWM" does say "with no IUPAC ambiguity letters" and Gaps are >> ambiguity letters. >> > > Gaps in alignments are not represented with ambiguity letters but with > the "-" letter: > > > library(BSgenome.**Dmelanogaster.UCSC.dm3) > > pairwiseAlignment("**ACCACTATCCATTACGGGCCCAGT", > unmasked(Dmelanogasterchr2L), type="global-local") > Global-Local PairwiseAlignmentsSingleSubjec**t (1 of 1) > pattern: [1] ACCACTATCCATTACGGGCCCAGT > subject: [1216159] ACCACTAT-CATTACGG--CCAGT > score: 9.616879 > > So let's say you want to compute the consensus matrix from a > collection of DNA sequences that contain alignment gaps: > > > dna <- DNAStringSet(c("CTAATGG--**CCAGT", > "AT-CATTG-CCAGA", > "CTATCAACGG--AG")) > > cm <- consensusMatrix(dna) > > dim(cm) > [1] 17 14 > > rownames(cm) > [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" > > If you try to pass this matrix directly to PWM(), it will be rejected > with the error you reported because PWM() only accepts a position > frequency matrix with 4 rows: 1 for each DNA base letter. > > > PWM(cm) > > Error in .normargPfm(x) : > invalid PFM 'x': IUPAC ambiguity letters are represented > > I realize the error message is mis-leading (because "-" is not an > ambiguity letter) and I'm going to modify it to make it a little bit > more informative. > > To work around this, one might want to subset 'cm' to keep only the > A, C, G, and T rows before passing it to PWM(). Unfortunately that > doesn't work either because the Wasserman & Sandelin algo used > internally by PWM() to compute a Position Weight Matrix from a Position > Frequency Matrix expects the latter to have all its columns sum up to > the same value: > > > PWM(cm[DNA_BASES, ]) > Error in .normargPfm(x) : > invalid PFM 'x': all columns in 'x' must sum to the same value. > If the PFM was obtained by calling consensusMatrix() on a DNAStringSet > object, please make sure that this object is rectangular (i.e. has a > constant width). > > To me this sounds like a legitimate situation for bypassing the > pfm-to-pwm transformation and use the pfm directly as a pwm: > > pwm <- cm[DNA_BASES, ] > matchPWM(pwm, subject) > > Hope this helps, > H. > > >> sessionInfo() >>> >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.**UTF-8/C/en_US.UTF-8/en_US.UTF-**8 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] Biostrings_2.28.0 IRanges_1.18.3 BiocGenerics_0.6.0 >> >> loaded via a namespace (and not attached): >> [1] stats4_3.0.1 tools_3.0.1 >> >> -best >> -Lax >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]