consensusString Function

0

Entering edit mode

Erik Wright ▴ 210

@erik-wright-4003

Last seen 9.6 years ago

Hello, I am trying to get a consensus string for a DNAStringSet, but I am getting an error. The documentation for consensusString says the argument "x" is either a consensus matrix or an XStringSet. So this should work, right?: > myDNAStringSet <- DNAStringSet(c("NNNN","ACTG")) > consensusString(myDNAStringSet) Error in .local(x, ...) : 'threshold' must be a numeric in (0, 1/sum(rowSums(x) > 0)] Specifying a threshold in the arguments doesn't seem to make a difference. Thanks!, Erik

• 2.1k views

ADD COMMENT • link updated 14.1 years ago by Patrick Aboyoun ★ 1.6k • written 14.1 years ago by Erik Wright ▴ 210

0

Entering edit mode

Heidi Dvinge ★ 2.0k

@heidi-dvinge-2195

Last seen 9.6 years ago

Hello Erik, unfortunately I'm not familiar with the Biostrings package, so I can't tell you why this doesn't work, but until someone else can answer, this seems to be a work-around. Apparently, consensusString doesn't handle Ns. > test <- DNAStringSet(c("AANN","ACTG")) > consensusString(test) Error in .local(x, ...) : 'threshold' must be a numeric in (0, 1/sum(rowSums(x) > 0)] If there are no Ns things are okay though. > test2 <- DNAStringSet(c("AAAA","ACTG")) > consensusString(test2) [1] "AMWR" However, Ns seem acceptable if the consensus matrix is calculated first, although they might result in ?s where no consensus could be found. > test3 <- consensusMatrix(test) > consensusString(test3) [1] "A???" HTH \Heidi > Hello, > > I am trying to get a consensus string for a DNAStringSet, but I am getting > an error. The documentation for consensusString says the argument "x" is > either a consensus matrix or an XStringSet. So this should work, right?: >> myDNAStringSet <- DNAStringSet(c("NNNN","ACTG")) >> consensusString(myDNAStringSet) > Error in .local(x, ...) : > 'threshold' must be a numeric in (0, 1/sum(rowSums(x) > 0)] > > Specifying a threshold in the arguments doesn't seem to make a difference. > > Thanks!, > Erik > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 14.1 years ago Heidi Dvinge ★ 2.0k

0

Entering edit mode

Hi Erik, Herv'e please always provide the output of sessionInfo(), and a complete reproducible example (you let Heidi and the others guess that you're talking about the Biostrings package). With a more recent version of Biostrings, I get: library("Biostrings") consensusString( DNAStringSet(c("AAAA","ACTG")) ) # [1] "AMWR" consensusString( DNAStringSet(c("AAAB","ACTG")) ) # Error in FUN(newX[, i], ...) : 'ambiguityMap' is missing some combinations of row names And going into the debugger where the error is caused, i.e. within the function "consensusLetter", the expression i <- paste(all_letters[col >= threshold], collapse = "") is evaluated where Browse[1]> all_letters [1] "A" "C" "G" "T" "B" ## length is 5 Browse[1]> col [1] 1 0 0 0 ## length is 4 Browse[1]> i [1] "AB" ## recycling rule was applied which seems unintended and with some more insight will probably lead to the fixing of a bug. Best wishes Wolfgang > sessionInfo() R version 2.12.0 Under development (unstable) (2010-04-06 r51617) x86_64-unknown-linux-gnu locale: [1] LC_CTYPE=C LC_NUMERIC=C LC_TIME=C [4] LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=it_IT.UTF-8 [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] Biostrings_2.15.26 IRanges_1.5.74 fortunes_1.3-7 loaded via a namespace (and not attached): [1] Biobase_2.7.5 Heidi Dvinge ha scritto: > Hello Erik, > > unfortunately I'm not familiar with the Biostrings package, so I can't > tell you why this doesn't work, but until someone else can answer, this > seems to be a work-around. > > Apparently, consensusString doesn't handle Ns. > >> test <- DNAStringSet(c("AANN","ACTG")) >> consensusString(test) > Error in .local(x, ...) : > 'threshold' must be a numeric in (0, 1/sum(rowSums(x) > 0)] > > If there are no Ns things are okay though. > >> test2 <- DNAStringSet(c("AAAA","ACTG")) >> consensusString(test2) > [1] "AMWR" > > However, Ns seem acceptable if the consensus matrix is calculated first, > although they might result in ?s where no consensus could be found. > >> test3 <- consensusMatrix(test) >> consensusString(test3) > [1] "A???" > > HTH > \Heidi > > >> Hello, >> >> I am trying to get a consensus string for a DNAStringSet, but I am getting >> an error. The documentation for consensusString says the argument "x" is >> either a consensus matrix or an XStringSet. So this should work, right?: >>> myDNAStringSet <- DNAStringSet(c("NNNN","ACTG")) >>> consensusString(myDNAStringSet) >> Error in .local(x, ...) : >> 'threshold' must be a numeric in (0, 1/sum(rowSums(x) > 0)] >> >> Specifying a threshold in the arguments doesn't seem to make a difference. >> >> Thanks!, >> Erik >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Best wishes Wolfgang -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber/contact

ADD REPLY • link 14.1 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Erik, Heidi, and Wolfgang, To bring this thread full circle, Biostrings::consensusString didn't support ambiguity letters in input strings for BioC <= 2.5 (R <= 2.10). This support has been added to BioC 2.6 (R 2.11), but as Wolfgang mentioned there was a bug in the code. This bug has now been fixed in BioC 2.6 and will be available for download from bioconductor.org within 36 hours. Here are some examples of consensusString operating on DNAStringSet objects containing ambiguity letters: > consensusString(DNAStringSet(c("NNNN","ACTG"))) [1] "ACTG" > consensusString(DNAStringSet(c("AANN","ACTG"))) [1] "AMTG" > consensusString(DNAStringSet(c("ACAG","ACAR"))) [1] "ACAR" > consensusString(DNAStringSet(c("ACAG","ACAR", "ACAG"))) [1] "ACAG" > sessionInfo() R version 2.11.0 alpha (2010-04-04 r51591) i386-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.15.27 IRanges_1.5.74 loaded via a namespace (and not attached): [1] Biobase_2.7.5 On 4/6/10 2:36 PM, Wolfgang Huber wrote: > Hi Erik, Herv'e > > please always provide the output of sessionInfo(), and a complete > reproducible example (you let Heidi and the others guess that you're > talking about the Biostrings package). > > With a more recent version of Biostrings, I get: > > library("Biostrings") > consensusString( DNAStringSet(c("AAAA","ACTG")) ) > # [1] "AMWR" > > consensusString( DNAStringSet(c("AAAB","ACTG")) ) > # Error in FUN(newX[, i], ...) : > 'ambiguityMap' is missing some combinations of row names > > And going into the debugger where the error is caused, i.e. within the > function "consensusLetter", the expression > > i <- paste(all_letters[col >= threshold], collapse = "") > > is evaluated where > Browse[1]> all_letters > [1] "A" "C" "G" "T" "B" ## length is 5 > Browse[1]> col > [1] 1 0 0 0 ## length is 4 > Browse[1]> i > [1] "AB" ## recycling rule was applied > > which seems unintended and with some more insight will probably lead > to the fixing of a bug. > > Best wishes > Wolfgang > > > > sessionInfo() > R version 2.12.0 Under development (unstable) (2010-04-06 r51617) > x86_64-unknown-linux-gnu > > locale: > [1] LC_CTYPE=C LC_NUMERIC=C LC_TIME=C > [4] LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=it_IT.UTF-8 > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > other attached packages: > [1] Biostrings_2.15.26 IRanges_1.5.74 fortunes_1.3-7 > > loaded via a namespace (and not attached): > [1] Biobase_2.7.5 > > Heidi Dvinge ha scritto: >> Hello Erik, >> >> unfortunately I'm not familiar with the Biostrings package, so I can't >> tell you why this doesn't work, but until someone else can answer, this >> seems to be a work-around. >> >> Apparently, consensusString doesn't handle Ns. >> >>> test <- DNAStringSet(c("AANN","ACTG")) >>> consensusString(test) >> Error in .local(x, ...) : >> 'threshold' must be a numeric in (0, 1/sum(rowSums(x) > 0)] >> >> If there are no Ns things are okay though. >> >>> test2 <- DNAStringSet(c("AAAA","ACTG")) >>> consensusString(test2) >> [1] "AMWR" >> >> However, Ns seem acceptable if the consensus matrix is calculated first, >> although they might result in ?s where no consensus could be found. >> >>> test3 <- consensusMatrix(test) >>> consensusString(test3) >> [1] "A???" >> >> HTH >> \Heidi >> >> >>> Hello, >>> >>> I am trying to get a consensus string for a DNAStringSet, but I am >>> getting >>> an error. The documentation for consensusString says the argument >>> "x" is >>> either a consensus matrix or an XStringSet. So this should work, >>> right?: >>>> myDNAStringSet <- DNAStringSet(c("NNNN","ACTG")) >>>> consensusString(myDNAStringSet) >>> Error in .local(x, ...) : >>> 'threshold' must be a numeric in (0, 1/sum(rowSums(x) > 0)] >>> >>> Specifying a threshold in the arguments doesn't seem to make a >>> difference. >>> >>> Thanks!, >>> Erik >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 14.1 years ago Patrick Aboyoun ★ 1.6k

0

Entering edit mode

Hi Patrick, Thanks for closing the loop on this issue. I have a question about the outputs you show below: If two strings are ACAG and ACAR where R can be A or G, then it makes sense to me that the consensus is ACAR. Why then is an N treated differently than an R? If two strings are NNNN and ACTG, where N can be A, C, T, or G, then based on the prior example the output should be NNNN, but the output you show is ACTG. If there are three sequences, ACAG, ACAR, and ACAG, and the threshold is set at the default 25% (for a DNAStringSet) then why is the output ACAG? If an R is a C or G, and the other two codons in the final position are G's then the sum is two G's and one C. One in three (33.33...%) being C is greater than the default threshold of 25%, so I would expect the consensus character to be R. Therefore the output should be ACAR. Thanks for helping me understand the output. Smiles, Erik On Apr 6, 2010, at 5:29 PM, Patrick Aboyoun wrote: > Erik, Heidi, and Wolfgang, > To bring this thread full circle, Biostrings::consensusString didn't support ambiguity letters in input strings for BioC <= 2.5 (R <= 2.10). This support has been added to BioC 2.6 (R 2.11), but as Wolfgang mentioned there was a bug in the code. This bug has now been fixed in BioC 2.6 and will be available for download from bioconductor.org within 36 hours. Here are some examples of consensusString operating on DNAStringSet objects containing ambiguity letters: > > > consensusString(DNAStringSet(c("NNNN","ACTG"))) > [1] "ACTG" > > consensusString(DNAStringSet(c("AANN","ACTG"))) > [1] "AMTG" > > consensusString(DNAStringSet(c("ACAG","ACAR"))) > [1] "ACAR" > > consensusString(DNAStringSet(c("ACAG","ACAR", "ACAG"))) > [1] "ACAG" > > sessionInfo() > R version 2.11.0 alpha (2010-04-04 r51591) > i386-apple-darwin9.8.0 > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Biostrings_2.15.27 IRanges_1.5.74 > > loaded via a namespace (and not attached): > [1] Biobase_2.7.5 > > > > On 4/6/10 2:36 PM, Wolfgang Huber wrote: >> Hi Erik, Herv'e >> >> please always provide the output of sessionInfo(), and a complete reproducible example (you let Heidi and the others guess that you're talking about the Biostrings package). >> >> With a more recent version of Biostrings, I get: >> >> library("Biostrings") >> consensusString( DNAStringSet(c("AAAA","ACTG")) ) >> # [1] "AMWR" >> >> consensusString( DNAStringSet(c("AAAB","ACTG")) ) >> # Error in FUN(newX[, i], ...) : >> 'ambiguityMap' is missing some combinations of row names >> >> And going into the debugger where the error is caused, i.e. within the function "consensusLetter", the expression >> >> i <- paste(all_letters[col >= threshold], collapse = "") >> >> is evaluated where >> Browse[1]> all_letters >> [1] "A" "C" "G" "T" "B" ## length is 5 >> Browse[1]> col >> [1] 1 0 0 0 ## length is 4 >> Browse[1]> i >> [1] "AB" ## recycling rule was applied >> >> which seems unintended and with some more insight will probably lead to the fixing of a bug. >> >> Best wishes >> Wolfgang >> >> >> > sessionInfo() >> R version 2.12.0 Under development (unstable) (2010-04-06 r51617) >> x86_64-unknown-linux-gnu >> >> locale: >> [1] LC_CTYPE=C LC_NUMERIC=C LC_TIME=C >> [4] LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=it_IT.UTF-8 >> [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C >> [10] LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices datasets utils methods base >> >> other attached packages: >> [1] Biostrings_2.15.26 IRanges_1.5.74 fortunes_1.3-7 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.7.5 >> >> Heidi Dvinge ha scritto: >>> Hello Erik, >>> >>> unfortunately I'm not familiar with the Biostrings package, so I can't >>> tell you why this doesn't work, but until someone else can answer, this >>> seems to be a work-around. >>> >>> Apparently, consensusString doesn't handle Ns. >>> >>>> test <- DNAStringSet(c("AANN","ACTG")) >>>> consensusString(test) >>> Error in .local(x, ...) : >>> 'threshold' must be a numeric in (0, 1/sum(rowSums(x) > 0)] >>> >>> If there are no Ns things are okay though. >>> >>>> test2 <- DNAStringSet(c("AAAA","ACTG")) >>>> consensusString(test2) >>> [1] "AMWR" >>> >>> However, Ns seem acceptable if the consensus matrix is calculated first, >>> although they might result in ?s where no consensus could be found. >>> >>>> test3 <- consensusMatrix(test) >>>> consensusString(test3) >>> [1] "A???" >>> >>> HTH >>> \Heidi >>> >>> >>>> Hello, >>>> >>>> I am trying to get a consensus string for a DNAStringSet, but I am getting >>>> an error. The documentation for consensusString says the argument "x" is >>>> either a consensus matrix or an XStringSet. So this should work, right?: >>>>> myDNAStringSet <- DNAStringSet(c("NNNN","ACTG")) >>>>> consensusString(myDNAStringSet) >>>> Error in .local(x, ...) : >>>> 'threshold' must be a numeric in (0, 1/sum(rowSums(x) > 0)] >>>> >>>> Specifying a threshold in the arguments doesn't seem to make a difference. >>>> >>>> Thanks!, >>>> Erik >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 14.1 years ago Erik Wright ▴ 210

0

Entering edit mode

Erik, The consensusString weights each input string equally and assigns an equal probability to each of the base letters represented by an ambiguity letter. So to use the examples you mentioned the pseudo arithmetic results in G + R => R 0.5 G + 0.5 R = 0.5 G + 0.5 (0.5 A + 0.5 G) = 0.5 G + 0.25 A + 0.25 G = 0.75 G + 0.25 A => R G + R + G => G 2/3 G + 1/3 R = 2/3 G + 1/3 (1/2 A + 1/2 G) = 2/3 G + 1/6 A + 1/6 G = 5/6 G + 1/6 A => G A + N => A 0.5 A + 0.5 N = 0.5 A + 0.5 (0.25 A + 0.25 C + 0.25 G + 0.25 T) = 0.625 A + 0.125 C + 0.125 G + 0.125 T => A There a little room for growth in the consensusString function. For example, it could accept a non-negative vector of weights so the input strings are not all equally weighted. Anything more complicated than that, however, and I would recommend representing each string in PWM form and then performing weighted/unweighted averages across each of the cells in the PWMs. Patrick On 4/7/10 9:06 AM, Erik Wright wrote: > Hi Patrick, > > Thanks for closing the loop on this issue. I have a question about the outputs you show below: > > If two strings are ACAG and ACAR where R can be A or G, then it makes sense to me that the consensus is ACAR. > > Why then is an N treated differently than an R? If two strings are NNNN and ACTG, where N can be A, C, T, or G, then based on the prior example the output should be NNNN, but the output you show is ACTG. > > If there are three sequences, ACAG, ACAR, and ACAG, and the threshold is set at the default 25% (for a DNAStringSet) then why is the output ACAG? If an R is a C or G, and the other two codons in the final position are G's then the sum is two G's and one C. One in three (33.33...%) being C is greater than the default threshold of 25%, so I would expect the consensus character to be R. Therefore the output should be ACAR. > > Thanks for helping me understand the output. > > Smiles, > Erik > > > > On Apr 6, 2010, at 5:29 PM, Patrick Aboyoun wrote: > > >> Erik, Heidi, and Wolfgang, >> To bring this thread full circle, Biostrings::consensusString didn't support ambiguity letters in input strings for BioC<= 2.5 (R<= 2.10). This support has been added to BioC 2.6 (R 2.11), but as Wolfgang mentioned there was a bug in the code. This bug has now been fixed in BioC 2.6 and will be available for download from bioconductor.org within 36 hours. Here are some examples of consensusString operating on DNAStringSet objects containing ambiguity letters: >> >> >>> consensusString(DNAStringSet(c("NNNN","ACTG"))) >>> >> [1] "ACTG" >> >>> consensusString(DNAStringSet(c("AANN","ACTG"))) >>> >> [1] "AMTG" >> >>> consensusString(DNAStringSet(c("ACAG","ACAR"))) >>> >> [1] "ACAR" >> >>> consensusString(DNAStringSet(c("ACAG","ACAR", "ACAG"))) >>> >> [1] "ACAG" >> >>> sessionInfo() >>> >> R version 2.11.0 alpha (2010-04-04 r51591) >> i386-apple-darwin9.8.0 >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] Biostrings_2.15.27 IRanges_1.5.74 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.7.5 >> >> >> >> On 4/6/10 2:36 PM, Wolfgang Huber wrote: >> >>> Hi Erik, Herv'e >>> >>> please always provide the output of sessionInfo(), and a complete reproducible example (you let Heidi and the others guess that you're talking about the Biostrings package). >>> >>> With a more recent version of Biostrings, I get: >>> >>> library("Biostrings") >>> consensusString( DNAStringSet(c("AAAA","ACTG")) ) >>> # [1] "AMWR" >>> >>> consensusString( DNAStringSet(c("AAAB","ACTG")) ) >>> # Error in FUN(newX[, i], ...) : >>> 'ambiguityMap' is missing some combinations of row names >>> >>> And going into the debugger where the error is caused, i.e. within the function "consensusLetter", the expression >>> >>> i<- paste(all_letters[col>= threshold], collapse = "") >>> >>> is evaluated where >>> Browse[1]> all_letters >>> [1] "A" "C" "G" "T" "B" ## length is 5 >>> Browse[1]> col >>> [1] 1 0 0 0 ## length is 4 >>> Browse[1]> i >>> [1] "AB" ## recycling rule was applied >>> >>> which seems unintended and with some more insight will probably lead to the fixing of a bug. >>> >>> Best wishes >>> Wolfgang >>> >>> >>> >>>> sessionInfo() >>>> >>> R version 2.12.0 Under development (unstable) (2010-04-06 r51617) >>> x86_64-unknown-linux-gnu >>> >>> locale: >>> [1] LC_CTYPE=C LC_NUMERIC=C LC_TIME=C >>> [4] LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=it_IT.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C >>> [10] LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices datasets utils methods base >>> >>> other attached packages: >>> [1] Biostrings_2.15.26 IRanges_1.5.74 fortunes_1.3-7 >>> >>> loaded via a namespace (and not attached): >>> [1] Biobase_2.7.5 >>> >>> Heidi Dvinge ha scritto: >>> >>>> Hello Erik, >>>> >>>> unfortunately I'm not familiar with the Biostrings package, so I can't >>>> tell you why this doesn't work, but until someone else can answer, this >>>> seems to be a work-around. >>>> >>>> Apparently, consensusString doesn't handle Ns. >>>> >>>> >>>>> test<- DNAStringSet(c("AANN","ACTG")) >>>>> consensusString(test) >>>>> >>>> Error in .local(x, ...) : >>>> 'threshold' must be a numeric in (0, 1/sum(rowSums(x)> 0)] >>>> >>>> If there are no Ns things are okay though. >>>> >>>> >>>>> test2<- DNAStringSet(c("AAAA","ACTG")) >>>>> consensusString(test2) >>>>> >>>> [1] "AMWR" >>>> >>>> However, Ns seem acceptable if the consensus matrix is calculated first, >>>> although they might result in ?s where no consensus could be found. >>>> >>>> >>>>> test3<- consensusMatrix(test) >>>>> consensusString(test3) >>>>> >>>> [1] "A???" >>>> >>>> HTH >>>> \Heidi >>>> >>>> >>>> >>>>> Hello, >>>>> >>>>> I am trying to get a consensus string for a DNAStringSet, but I am getting >>>>> an error. The documentation for consensusString says the argument "x" is >>>>> either a consensus matrix or an XStringSet. So this should work, right?: >>>>> >>>>>> myDNAStringSet<- DNAStringSet(c("NNNN","ACTG")) >>>>>> consensusString(myDNAStringSet) >>>>>> >>>>> Error in .local(x, ...) : >>>>> 'threshold' must be a numeric in (0, 1/sum(rowSums(x)> 0)] >>>>> >>>>> Specifying a threshold in the arguments doesn't seem to make a difference. >>>>> >>>>> Thanks!, >>>>> Erik >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >

ADD REPLY • link 14.1 years ago Patrick Aboyoun ★ 1.6k

0

Entering edit mode

Patrick Aboyoun ★ 1.6k

@patrick-aboyoun-6734

Last seen 9.6 years ago

United States

Erik, I should point out that the threshold parameter supplies some flexibility in how unambiguous a base needs to be in the consensus: > consensusString(DNAStringSet(c("A", "N")), threshold = 1e-6) [1] "N" > consensusString(DNAStringSet(c("G", "R", "G")), threshold = 1e-6) [1] "R" > sessionInfo() R version 2.11.0 alpha (2010-04-04 r51591) i386-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.15.27 IRanges_1.5.74 GEOquery_2.11.3 [4] Biobase_2.7.5 RCurl_1.3-1 bitops_1.0-4.1 [7] SRAdb_1.1.10 RSQLite_0.8-4 DBI_0.2-5 [10] codetoolsBioC_0.0.14 loaded via a namespace (and not attached): [1] codetools_0.2-2 Patrick On 4/7/10 9:41 AM, Patrick Aboyoun wrote: > Erik, > The consensusString weights each input string equally and assigns an > equal probability to each of the base letters represented by an > ambiguity letter. So to use the examples you mentioned the pseudo > arithmetic results in > > G + R => R > 0.5 G + 0.5 R = 0.5 G + 0.5 (0.5 A + 0.5 G) = 0.5 G + 0.25 A + 0.25 G > = 0.75 G + 0.25 A => R > > G + R + G => G > 2/3 G + 1/3 R = 2/3 G + 1/3 (1/2 A + 1/2 G) = 2/3 G + 1/6 A + 1/6 G = > 5/6 G + 1/6 A => G > > A + N => A > 0.5 A + 0.5 N = 0.5 A + 0.5 (0.25 A + 0.25 C + 0.25 G + 0.25 T) = > 0.625 A + 0.125 C + 0.125 G + 0.125 T => A > > There a little room for growth in the consensusString function. For > example, it could accept a non-negative vector of weights so the input > strings are not all equally weighted. Anything more complicated than > that, however, and I would recommend representing each string in PWM > form and then performing weighted/unweighted averages across each of > the cells in the PWMs. > > > Patrick > > > On 4/7/10 9:06 AM, Erik Wright wrote: >> Hi Patrick, >> >> Thanks for closing the loop on this issue. I have a question about >> the outputs you show below: >> >> If two strings are ACAG and ACAR where R can be A or G, then it makes >> sense to me that the consensus is ACAR. >> >> Why then is an N treated differently than an R? If two strings are >> NNNN and ACTG, where N can be A, C, T, or G, then based on the prior >> example the output should be NNNN, but the output you show is ACTG. >> >> If there are three sequences, ACAG, ACAR, and ACAG, and the threshold >> is set at the default 25% (for a DNAStringSet) then why is the output >> ACAG? If an R is a C or G, and the other two codons in the final >> position are G's then the sum is two G's and one C. One in three >> (33.33...%) being C is greater than the default threshold of 25%, so >> I would expect the consensus character to be R. Therefore the output >> should be ACAR. >> >> Thanks for helping me understand the output. >> >> Smiles, >> Erik >> >> >> >> On Apr 6, 2010, at 5:29 PM, Patrick Aboyoun wrote: >> >>> Erik, Heidi, and Wolfgang, >>> To bring this thread full circle, Biostrings::consensusString didn't >>> support ambiguity letters in input strings for BioC<= 2.5 (R<= >>> 2.10). This support has been added to BioC 2.6 (R 2.11), but as >>> Wolfgang mentioned there was a bug in the code. This bug has now >>> been fixed in BioC 2.6 and will be available for download from >>> bioconductor.org within 36 hours. Here are some examples of >>> consensusString operating on DNAStringSet objects containing >>> ambiguity letters: >>> >>>> consensusString(DNAStringSet(c("NNNN","ACTG"))) >>> [1] "ACTG" >>>> consensusString(DNAStringSet(c("AANN","ACTG"))) >>> [1] "AMTG" >>>> consensusString(DNAStringSet(c("ACAG","ACAR"))) >>> [1] "ACAR" >>>> consensusString(DNAStringSet(c("ACAG","ACAR", "ACAG"))) >>> [1] "ACAG" >>>> sessionInfo() >>> R version 2.11.0 alpha (2010-04-04 r51591) >>> i386-apple-darwin9.8.0 >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] Biostrings_2.15.27 IRanges_1.5.74 >>> >>> loaded via a namespace (and not attached): >>> [1] Biobase_2.7.5 >>> >>> >>> >>> On 4/6/10 2:36 PM, Wolfgang Huber wrote: >>>> Hi Erik, Herv'e >>>> >>>> please always provide the output of sessionInfo(), and a complete >>>> reproducible example (you let Heidi and the others guess that >>>> you're talking about the Biostrings package). >>>> >>>> With a more recent version of Biostrings, I get: >>>> >>>> library("Biostrings") >>>> consensusString( DNAStringSet(c("AAAA","ACTG")) ) >>>> # [1] "AMWR" >>>> >>>> consensusString( DNAStringSet(c("AAAB","ACTG")) ) >>>> # Error in FUN(newX[, i], ...) : >>>> 'ambiguityMap' is missing some combinations of row names >>>> >>>> And going into the debugger where the error is caused, i.e. within >>>> the function "consensusLetter", the expression >>>> >>>> i<- paste(all_letters[col>= threshold], collapse = "") >>>> >>>> is evaluated where >>>> Browse[1]> all_letters >>>> [1] "A" "C" "G" "T" "B" ## length is 5 >>>> Browse[1]> col >>>> [1] 1 0 0 0 ## length is 4 >>>> Browse[1]> i >>>> [1] "AB" ## recycling rule was applied >>>> >>>> which seems unintended and with some more insight will probably >>>> lead to the fixing of a bug. >>>> >>>> Best wishes >>>> Wolfgang >>>> >>>> >>>>> sessionInfo() >>>> R version 2.12.0 Under development (unstable) (2010-04-06 r51617) >>>> x86_64-unknown-linux-gnu >>>> >>>> locale: >>>> [1] LC_CTYPE=C LC_NUMERIC=C LC_TIME=C >>>> [4] LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=it_IT.UTF-8 >>>> [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C >>>> [10] LC_TELEPHONE=C LC_MEASUREMENT=C >>>> LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices datasets utils methods base >>>> >>>> other attached packages: >>>> [1] Biostrings_2.15.26 IRanges_1.5.74 fortunes_1.3-7 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] Biobase_2.7.5 >>>> >>>> Heidi Dvinge ha scritto: >>>>> Hello Erik, >>>>> >>>>> unfortunately I'm not familiar with the Biostrings package, so I >>>>> can't >>>>> tell you why this doesn't work, but until someone else can answer, >>>>> this >>>>> seems to be a work-around. >>>>> >>>>> Apparently, consensusString doesn't handle Ns. >>>>> >>>>>> test<- DNAStringSet(c("AANN","ACTG")) >>>>>> consensusString(test) >>>>> Error in .local(x, ...) : >>>>> 'threshold' must be a numeric in (0, 1/sum(rowSums(x)> 0)] >>>>> >>>>> If there are no Ns things are okay though. >>>>> >>>>>> test2<- DNAStringSet(c("AAAA","ACTG")) >>>>>> consensusString(test2) >>>>> [1] "AMWR" >>>>> >>>>> However, Ns seem acceptable if the consensus matrix is calculated >>>>> first, >>>>> although they might result in ?s where no consensus could be found. >>>>> >>>>>> test3<- consensusMatrix(test) >>>>>> consensusString(test3) >>>>> [1] "A???" >>>>> >>>>> HTH >>>>> \Heidi >>>>> >>>>> >>>>>> Hello, >>>>>> >>>>>> I am trying to get a consensus string for a DNAStringSet, but I >>>>>> am getting >>>>>> an error. The documentation for consensusString says the >>>>>> argument "x" is >>>>>> either a consensus matrix or an XStringSet. So this should work, >>>>>> right?: >>>>>>> myDNAStringSet<- DNAStringSet(c("NNNN","ACTG")) >>>>>>> consensusString(myDNAStringSet) >>>>>> Error in .local(x, ...) : >>>>>> 'threshold' must be a numeric in (0, 1/sum(rowSums(x)> 0)] >>>>>> >>>>>> Specifying a threshold in the arguments doesn't seem to make a >>>>>> difference. >>>>>> >>>>>> Thanks!, >>>>>> Erik >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at stat.math.ethz.ch >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 14.1 years ago Patrick Aboyoun ★ 1.6k

Login before adding your answer.