Creat PWM from Alignment with Gaps
1
1
Entering edit mode
@herve-pages-1542
Last seen 4 days ago
Seattle, WA, United States
Hi Lax, On 09/25/2013 09:52 AM, Lakshmanan Iyer wrote: > In Bioconductor, Is it possible to create a PWM from DNA sequence alignment > with gaps to be used to scan sequences? > > When I try it as suggested in "Biostrings" it gives the following error: > >> pwm <- PWM(pfm) > Error in .normargPfm(x) : > invalid PFM 'x': IUPAC ambiguity letters are represented > > The help for "PWM" does say "with no IUPAC ambiguity letters" and Gaps are > ambiguity letters. Gaps in alignments are not represented with ambiguity letters but with the "-" letter: > library(BSgenome.Dmelanogaster.UCSC.dm3) > pairwiseAlignment("ACCACTATCCATTACGGGCCCAGT", unmasked(Dmelanogaster$chr2L), type="global-local") Global-Local PairwiseAlignmentsSingleSubject (1 of 1) pattern: [1] ACCACTATCCATTACGGGCCCAGT subject: [1216159] ACCACTAT-CATTACGG--CCAGT score: 9.616879 So let's say you want to compute the consensus matrix from a collection of DNA sequences that contain alignment gaps: > dna <- DNAStringSet(c("CTAATGG--CCAGT", "AT-CATTG-CCAGA", "CTATCAACGG--AG")) > cm <- consensusMatrix(dna) > dim(cm) [1] 17 14 > rownames(cm) [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" If you try to pass this matrix directly to PWM(), it will be rejected with the error you reported because PWM() only accepts a position frequency matrix with 4 rows: 1 for each DNA base letter. > PWM(cm) Error in .normargPfm(x) : invalid PFM 'x': IUPAC ambiguity letters are represented I realize the error message is mis-leading (because "-" is not an ambiguity letter) and I'm going to modify it to make it a little bit more informative. To work around this, one might want to subset 'cm' to keep only the A, C, G, and T rows before passing it to PWM(). Unfortunately that doesn't work either because the Wasserman & Sandelin algo used internally by PWM() to compute a Position Weight Matrix from a Position Frequency Matrix expects the latter to have all its columns sum up to the same value: > PWM(cm[DNA_BASES, ]) Error in .normargPfm(x) : invalid PFM 'x': all columns in 'x' must sum to the same value. If the PFM was obtained by calling consensusMatrix() on a DNAStringSet object, please make sure that this object is rectangular (i.e. has a constant width). To me this sounds like a legitimate situation for bypassing the pfm-to-pwm transformation and use the pfm directly as a pwm: pwm <- cm[DNA_BASES, ] matchPWM(pwm, subject) Hope this helps, H. > >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] Biostrings_2.28.0 IRanges_1.18.3 BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] stats4_3.0.1 tools_3.0.1 > > -best > -Lax > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
Alignment Cancer Alignment Cancer • 3.1k views
ADD COMMENT
0
Entering edit mode
@lakshmanan-iyer-1829
Last seen 9.3 years ago
United States
Thanks Herve That work around worked.... -best -Lax On Tue, Oct 1, 2013 at 2:53 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Lax, > > > On 09/25/2013 09:52 AM, Lakshmanan Iyer wrote: > >> In Bioconductor, Is it possible to create a PWM from DNA sequence >> alignment >> with gaps to be used to scan sequences? >> >> When I try it as suggested in "Biostrings" it gives the following error: >> >> pwm <- PWM(pfm) >>> >> Error in .normargPfm(x) : >> invalid PFM 'x': IUPAC ambiguity letters are represented >> >> The help for "PWM" does say "with no IUPAC ambiguity letters" and Gaps are >> ambiguity letters. >> > > Gaps in alignments are not represented with ambiguity letters but with > the "-" letter: > > > library(BSgenome.**Dmelanogaster.UCSC.dm3) > > pairwiseAlignment("**ACCACTATCCATTACGGGCCCAGT", > unmasked(Dmelanogaster$chr2L), type="global-local") > Global-Local PairwiseAlignmentsSingleSubjec**t (1 of 1) > pattern: [1] ACCACTATCCATTACGGGCCCAGT > subject: [1216159] ACCACTAT-CATTACGG--CCAGT > score: 9.616879 > > So let's say you want to compute the consensus matrix from a > collection of DNA sequences that contain alignment gaps: > > > dna <- DNAStringSet(c("CTAATGG--**CCAGT", > "AT-CATTG-CCAGA", > "CTATCAACGG--AG")) > > cm <- consensusMatrix(dna) > > dim(cm) > [1] 17 14 > > rownames(cm) > [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" > > If you try to pass this matrix directly to PWM(), it will be rejected > with the error you reported because PWM() only accepts a position > frequency matrix with 4 rows: 1 for each DNA base letter. > > > PWM(cm) > > Error in .normargPfm(x) : > invalid PFM 'x': IUPAC ambiguity letters are represented > > I realize the error message is mis-leading (because "-" is not an > ambiguity letter) and I'm going to modify it to make it a little bit > more informative. > > To work around this, one might want to subset 'cm' to keep only the > A, C, G, and T rows before passing it to PWM(). Unfortunately that > doesn't work either because the Wasserman & Sandelin algo used > internally by PWM() to compute a Position Weight Matrix from a Position > Frequency Matrix expects the latter to have all its columns sum up to > the same value: > > > PWM(cm[DNA_BASES, ]) > Error in .normargPfm(x) : > invalid PFM 'x': all columns in 'x' must sum to the same value. > If the PFM was obtained by calling consensusMatrix() on a DNAStringSet > object, please make sure that this object is rectangular (i.e. has a > constant width). > > To me this sounds like a legitimate situation for bypassing the > pfm-to-pwm transformation and use the pfm directly as a pwm: > > pwm <- cm[DNA_BASES, ] > matchPWM(pwm, subject) > > Hope this helps, > H. > > >> sessionInfo() >>> >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.**UTF-8/C/en_US.UTF-8/en_US.UTF-**8 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] Biostrings_2.28.0 IRanges_1.18.3 BiocGenerics_0.6.0 >> >> loaded via a namespace (and not attached): >> [1] stats4_3.0.1 tools_3.0.1 >> >> -best >> -Lax >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
ADD COMMENT

Login before adding your answer.

Traffic: 686 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6