Fisher-Yates algorithm for DNA shuffling ?

0

Entering edit mode

Robert Castelo ★ 3.4k

@rcastelo

Last seen 5 months ago

Barcelona/Universitat Pompeu Fabra

dear list, i'm interested in using some proper DNA shuffling procedure like the Fisher-Yates algorithm: R. A. Fisher and F. Yates, Example 12, Statistical Tables, London, 1938. Richard Durstenfeld, Algorithm 235: Random permutation, CACM 7(7):420, July 1964. i've been googling bioconductor and the r-project for this, and particularly looking at the Biostrings package, but I've failed to find anything. just in case i'm missing some implementation of this DNA shuffling in R, i'd like to ask the list whether anybody knows of that procedure being implemented in R or, even better, in a bioconductor package such that would work with DNAString objects. the reason for that is that i would like to simulate random DNA strings preserving nucleotide composition in order to calculate empirical p-values for motif scanning. if it is not yet implemented i would like to suggest the package maintainers of Biostrings or any other biological sequence infrastructure package in Bioconductor to have it implemented in C for having maximum speed with this kind of simulations. just in case it eases my request here goes some simple R code implementing the Fisher-Yates shuffling algorithm: fy <- function(seq) { seq <- unlist(strsplit(seq, "")) n <- length(seq) i <- n while (i > 0) { j <- floor(runif(1)*i+1) if (i != j) { tmp <- seq[i] seq[i] <- seq[j] seq[j] <- tmp } i <- i - 1 } paste(seq, collapse="") } and this is an example of how to use it: # generate some random DNA sequence of length 3 seq <- paste(sample(c("A","C","G","T"), size=3, replace=TRUE), collapse="") seq [1] "ATG" # shuffle the sequence 10,000 times and store all the permutations shuffledseqs <- c() for (i in 1:10000) shuffledseqs <- c(shuffledseqs, fy(seq)) # verify that indeed the permutations occur uniformly at random base::table(shuffledseqs)/10000 shuffledseqs AGT ATG GAT GTA TAG TGA 0.1693 0.1682 0.1678 0.1629 0.1610 0.1708 1/6 [1] 0.1666667 thanks! robert.

Biostrings Biostrings • 3.1k views

ADD COMMENT • link updated 16.6 years ago by Hervé Pagès 16k • written 16.6 years ago by Robert Castelo ★ 3.4k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 2 days ago

Seattle, WA, United States

Hi Robert, Robert Castelo wrote: > dear list, > > i'm interested in using some proper DNA shuffling procedure like the > Fisher-Yates algorithm: > > R. A. Fisher and F. Yates, Example 12, Statistical Tables, London, 1938. > > Richard Durstenfeld, Algorithm 235: Random permutation, CACM 7(7):420, > July 1964. The Fisher & Yates or Durstenfeld algos are just particular implementations of a ramdom permutation generator. R has the sample() function to generate a random permutation. The man page doesn't specify what implementation is used but does that really matter? > > i've been googling bioconductor and the r-project for this, and > particularly looking at the Biostrings package, but I've failed to find > anything. just in case i'm missing some implementation of this DNA > shuffling in R, i'd like to ask the list whether anybody knows of that > procedure being implemented in R or, even better, in a bioconductor > package such that would work with DNAString objects. > > the reason for that is that i would like to simulate random DNA strings > preserving nucleotide composition in order to calculate empirical > p-values for motif scanning. Generate a random DNA string (of length 1000) with specific nucleotide probabilities: x <- paste(sample(c("A", "C", "G", "T"), 1000, replace=TRUE, prob=c(0.2, 0.55, 0.1, 0.15)), collapse="") x <- DNAString(x) Then: > alphabetFrequency(x, baseOnly=TRUE) A C G T other 214 531 110 145 0 If you really want to shuffle a given DNAString object x0 in order to preserve its nucleotide composition: x1 <- x0[sample(length(x0))] For example, shuffling the 'x' obtained above gives: x1 <- x[sample(length(x))] > alphabetFrequency(x1, baseOnly=TRUE) A C G T other 214 531 110 145 0 Performance: > system.time(for (i in 1:10000) {y <- x[sample(length(x))]}) user system elapsed 5.760 0.008 5.764 Everything is already here and probably fast enough. No need to re-implement anything. Cheers, H. > > if it is not yet implemented i would like to suggest the package > maintainers of Biostrings or any other biological sequence > infrastructure package in Bioconductor to have it implemented in C for > having maximum speed with this kind of simulations. just in case it > eases my request here goes some simple R code implementing the > Fisher-Yates shuffling algorithm: > > fy <- function(seq) { > seq <- unlist(strsplit(seq, "")) > n <- length(seq) > i <- n > while (i > 0) { > j <- floor(runif(1)*i+1) > if (i != j) { > tmp <- seq[i] > seq[i] <- seq[j] > seq[j] <- tmp > } > i <- i - 1 > } > paste(seq, collapse="") > } > > and this is an example of how to use it: > > # generate some random DNA sequence of length 3 > seq <- paste(sample(c("A","C","G","T"), size=3, replace=TRUE), collapse="") > seq > [1] "ATG" > > # shuffle the sequence 10,000 times and store all the permutations > shuffledseqs <- c() > for (i in 1:10000) shuffledseqs <- c(shuffledseqs, fy(seq)) > > # verify that indeed the permutations occur uniformly at random > base::table(shuffledseqs)/10000 > shuffledseqs > AGT ATG GAT GTA TAG TGA > 0.1693 0.1682 0.1678 0.1629 0.1610 0.1708 > 1/6 > [1] 0.1666667 > > > thanks! > robert. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 16.6 years ago Hervé Pagès 16k

0

Entering edit mode

2009/6/23 HervÃ© PagÃ¨s <hpages@fhcrc.org> > Hi Robert, > > Robert Castelo wrote: > >> dear list, >> >> i'm interested in using some proper DNA shuffling procedure like the >> Fisher-Yates algorithm: >> >> R. A. Fisher and F. Yates, Example 12, Statistical Tables, London, 1938. >> >> Richard Durstenfeld, Algorithm 235: Random permutation, CACM 7(7):420, >> July 1964. >> > > The Fisher & Yates or Durstenfeld algos are just particular implementations > of a ramdom permutation generator. > R has the sample() function to generate a random permutation. The man page > doesn't specify what implementation is used but does that really matter? > > >> i've been googling bioconductor and the r-project for this, and >> particularly looking at the Biostrings package, but I've failed to find >> anything. just in case i'm missing some implementation of this DNA >> shuffling in R, i'd like to ask the list whether anybody knows of that >> procedure being implemented in R or, even better, in a bioconductor >> package such that would work with DNAString objects. >> >> the reason for that is that i would like to simulate random DNA strings >> preserving nucleotide composition in order to calculate empirical >> p-values for motif scanning. >> > I guess it is obvious, but note that using random sequences based on single nucleotide frequencies may give you a more significant p-value (that is, it may be anti-conservative) than if you model the background more robustly using several nucleotides and taking into account genome differences in base content. There are a fair number of papers describing HMMs or other models that attempt to capture the underlying background model for motif scanning. Sean > > Generate a random DNA string (of length 1000) with specific > nucleotide probabilities: > > x <- paste(sample(c("A", "C", "G", "T"), 1000, replace=TRUE, > prob=c(0.2, 0.55, 0.1, 0.15)), collapse="") > x <- DNAString(x) > > Then: > > > alphabetFrequency(x, baseOnly=TRUE) > A C G T other > 214 531 110 145 0 > > If you really want to shuffle a given DNAString object x0 in order to > preserve its nucleotide composition: > > x1 <- x0[sample(length(x0))] > > For example, shuffling the 'x' obtained above gives: > > x1 <- x[sample(length(x))] > > > alphabetFrequency(x1, baseOnly=TRUE) > A C G T other > 214 531 110 145 0 > > Performance: > > > system.time(for (i in 1:10000) {y <- x[sample(length(x))]}) > user system elapsed > 5.760 0.008 5.764 > > Everything is already here and probably fast enough. No need to > re-implement anything. > > Cheers, > > H. > > > >> if it is not yet implemented i would like to suggest the package >> maintainers of Biostrings or any other biological sequence >> infrastructure package in Bioconductor to have it implemented in C for >> having maximum speed with this kind of simulations. just in case it >> eases my request here goes some simple R code implementing the >> Fisher-Yates shuffling algorithm: >> >> fy <- function(seq) { >> seq <- unlist(strsplit(seq, "")) >> n <- length(seq) >> i <- n >> while (i > 0) { >> j <- floor(runif(1)*i+1) >> if (i != j) { >> tmp <- seq[i] >> seq[i] <- seq[j] >> seq[j] <- tmp >> } >> i <- i - 1 >> } >> paste(seq, collapse="") >> } >> >> and this is an example of how to use it: >> >> # generate some random DNA sequence of length 3 >> seq <- paste(sample(c("A","C","G","T"), size=3, replace=TRUE), >> collapse="") >> seq >> [1] "ATG" >> >> # shuffle the sequence 10,000 times and store all the permutations >> shuffledseqs <- c() >> for (i in 1:10000) shuffledseqs <- c(shuffledseqs, fy(seq)) >> >> # verify that indeed the permutations occur uniformly at random >> base::table(shuffledseqs)/10000 >> shuffledseqs >> AGT ATG GAT GTA TAG TGA 0.1693 0.1682 0.1678 0.1629 >> 0.1610 0.1708 1/6 >> [1] 0.1666667 >> >> >> thanks! >> robert. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > HervÃ© PagÃ¨s > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 16.6 years ago Sean Davis 21k

0

Entering edit mode

hi Sean, On Tue, 2009-06-23 at 19:46 -0400, Sean Davis wrote: > > > 2009/6/23 Hervé Pagès <hpages at="" fhcrc.org=""> > Hi Robert, > > > Robert Castelo wrote: > dear list, > > i'm interested in using some proper DNA shuffling > procedure like the > Fisher-Yates algorithm: > > R. A. Fisher and F. Yates, Example 12, Statistical > Tables, London, 1938. > > Richard Durstenfeld, Algorithm 235: Random > permutation, CACM 7(7):420, > July 1964. > > > The Fisher & Yates or Durstenfeld algos are just particular > implementations > of a ramdom permutation generator. > R has the sample() function to generate a random permutation. > The man page > doesn't specify what implementation is used but does that > really matter? > > > > i've been googling bioconductor and the r-project for > this, and > particularly looking at the Biostrings package, but > I've failed to find > anything. just in case i'm missing some implementation > of this DNA > shuffling in R, i'd like to ask the list whether > anybody knows of that > procedure being implemented in R or, even better, in a > bioconductor > package such that would work with DNAString objects. > > the reason for that is that i would like to simulate > random DNA strings > preserving nucleotide composition in order to > calculate empirical > p-values for motif scanning. > > I guess it is obvious, but note that using random sequences based on > single nucleotide frequencies may give you a more significant p-value > (that is, it may be anti-conservative) than if you model the > background more robustly using several nucleotides and taking into > account genome differences in base content. There are a fair number > of papers describing HMMs or other models that attempt to capture the > underlying background model for motif scanning. thanks for pointing that out, i'd like definitely to try one of those approaches. so far, i'm shuffling dinucleotides and my p-values haven't changed much (a little more conservative, but not much more, the bulk of more significant ones is below more or less the same level). i assume because you haven't say it, that there's none of those more sophisticated possibly-HMM-based approaches implemented in bioconductor, right? i think it could be nice to have something like that next-to/within-in the matchPattern() or countPattern() functions from Biostrings, but this is just one opinion of an end-user and i don't pretend to start a discussion on software design and requirements. i've googled a bit and found lots of links to papers including keywords like "HMM", "background" and "motif scanning" but it's difficult for me to tell which one looks more promising to have some algorithm properly described or, even better, some available implementation. do you have any particular suggestion? thanks! robert. > Sean > > > > Generate a random DNA string (of length 1000) with specific > nucleotide probabilities: > > x <- paste(sample(c("A", "C", "G", "T"), 1000, replace=TRUE, > prob=c(0.2, 0.55, 0.1, 0.15)), collapse="") > x <- DNAString(x) > > Then: > > > alphabetFrequency(x, baseOnly=TRUE) > A C G T other > 214 531 110 145 0 > > If you really want to shuffle a given DNAString object x0 in > order to > preserve its nucleotide composition: > > x1 <- x0[sample(length(x0))] > > For example, shuffling the 'x' obtained above gives: > > x1 <- x[sample(length(x))] > > > alphabetFrequency(x1, baseOnly=TRUE) > A C G T other > 214 531 110 145 0 > > Performance: > > > system.time(for (i in 1:10000) {y <- x[sample(length(x))]}) > user system elapsed > 5.760 0.008 5.764 > > Everything is already here and probably fast enough. No need > to > re-implement anything. > > Cheers, > > > H. > > > > if it is not yet implemented i would like to suggest > the package > maintainers of Biostrings or any other biological > sequence > infrastructure package in Bioconductor to have it > implemented in C for > having maximum speed with this kind of simulations. > just in case it > eases my request here goes some simple R code > implementing the > Fisher-Yates shuffling algorithm: > > fy <- function(seq) { > seq <- unlist(strsplit(seq, "")) > n <- length(seq) > i <- n > while (i > 0) { > j <- floor(runif(1)*i+1) > if (i != j) { > tmp <- seq[i] > seq[i] <- seq[j] > seq[j] <- tmp > } > i <- i - 1 > } > paste(seq, collapse="") > } > > and this is an example of how to use it: > > # generate some random DNA sequence of length 3 > seq <- paste(sample(c("A","C","G","T"), size=3, > replace=TRUE), collapse="") > seq > [1] "ATG" > > # shuffle the sequence 10,000 times and store all the > permutations > shuffledseqs <- c() > for (i in 1:10000) shuffledseqs <- c(shuffledseqs, > fy(seq)) > > # verify that indeed the permutations occur uniformly > at random > base::table(shuffledseqs)/10000 > shuffledseqs > AGT ATG GAT GTA TAG TGA 0.1693 0.1682 > 0.1678 0.1629 0.1610 0.1708 1/6 > [1] 0.1666667 > > > thanks! > robert. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 16.6 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

Hi Robert, I sometimes permute the order of the motif and scan the original DNA sequence with that (for lots of permutations of the motif). This may actually correspond to a permutation of the DNA sequence at the single nucleotide level. Of course you could use HMM approaches at higher levls for the motif as well. I'm not sure how shuffling the motif plus scanning (multiple times) compares computationally to permuting the DNA sequence once. If possible you might want to consider looking at conservation of the motif matches. Given appropriate evolutionary models for the sequence this might represent a more solid basis for getting to the significance of motifs. Never did this myself though ... Ludo On Thu, Jun 25, 2009 at 05:14:33PM +0200, Robert Castelo wrote: > hi Sean, > > On Tue, 2009-06-23 at 19:46 -0400, Sean Davis wrote: > > > > > > 2009/6/23 Hervé Pagès <hpages at="" fhcrc.org=""> > > Hi Robert, > > > > > > Robert Castelo wrote: > > dear list, > > > > i'm interested in using some proper DNA shuffling > > procedure like the > > Fisher-Yates algorithm: > > > > R. A. Fisher and F. Yates, Example 12, Statistical > > Tables, London, 1938. > > > > Richard Durstenfeld, Algorithm 235: Random > > permutation, CACM 7(7):420, > > July 1964. > > > > > > The Fisher & Yates or Durstenfeld algos are just particular > > implementations > > of a ramdom permutation generator. > > R has the sample() function to generate a random permutation. > > The man page > > doesn't specify what implementation is used but does that > > really matter? > > > > > > > > i've been googling bioconductor and the r-project for > > this, and > > particularly looking at the Biostrings package, but > > I've failed to find > > anything. just in case i'm missing some implementation > > of this DNA > > shuffling in R, i'd like to ask the list whether > > anybody knows of that > > procedure being implemented in R or, even better, in a > > bioconductor > > package such that would work with DNAString objects. > > > > the reason for that is that i would like to simulate > > random DNA strings > > preserving nucleotide composition in order to > > calculate empirical > > p-values for motif scanning. > > > > I guess it is obvious, but note that using random sequences based on > > single nucleotide frequencies may give you a more significant p-value > > (that is, it may be anti-conservative) than if you model the > > background more robustly using several nucleotides and taking into > > account genome differences in base content. There are a fair number > > of papers describing HMMs or other models that attempt to capture the > > underlying background model for motif scanning. > > thanks for pointing that out, i'd like definitely to try one of those > approaches. so far, i'm shuffling dinucleotides and my p-values haven't > changed much (a little more conservative, but not much more, the bulk of > more significant ones is below more or less the same level). > > i assume because you haven't say it, that there's none of those more > sophisticated possibly-HMM-based approaches implemented in bioconductor, > right? i think it could be nice to have something like that > next-to/within-in the matchPattern() or countPattern() functions from > Biostrings, but this is just one opinion of an end-user and i don't > pretend to start a discussion on software design and requirements. > > i've googled a bit and found lots of links to papers including keywords > like "HMM", "background" and "motif scanning" but it's difficult for me > to tell which one looks more promising to have some algorithm properly > described or, even better, some available implementation. do you have > any particular suggestion? > > thanks! > robert. > > > Sean > > > > > > > > Generate a random DNA string (of length 1000) with specific > > nucleotide probabilities: > > > > x <- paste(sample(c("A", "C", "G", "T"), 1000, replace=TRUE, > > prob=c(0.2, 0.55, 0.1, 0.15)), collapse="") > > x <- DNAString(x) > > > > Then: > > > > > alphabetFrequency(x, baseOnly=TRUE) > > A C G T other > > 214 531 110 145 0 > > > > If you really want to shuffle a given DNAString object x0 in > > order to > > preserve its nucleotide composition: > > > > x1 <- x0[sample(length(x0))] > > > > For example, shuffling the 'x' obtained above gives: > > > > x1 <- x[sample(length(x))] > > > > > alphabetFrequency(x1, baseOnly=TRUE) > > A C G T other > > 214 531 110 145 0 > > > > Performance: > > > > > system.time(for (i in 1:10000) {y <- x[sample(length(x))]}) > > user system elapsed > > 5.760 0.008 5.764 > > > > Everything is already here and probably fast enough. No need > > to > > re-implement anything. > > > > Cheers, > > > > > > H. > > > > > > > > if it is not yet implemented i would like to suggest > > the package > > maintainers of Biostrings or any other biological > > sequence > > infrastructure package in Bioconductor to have it > > implemented in C for > > having maximum speed with this kind of simulations. > > just in case it > > eases my request here goes some simple R code > > implementing the > > Fisher-Yates shuffling algorithm: > > > > fy <- function(seq) { > > seq <- unlist(strsplit(seq, "")) > > n <- length(seq) > > i <- n > > while (i > 0) { > > j <- floor(runif(1)*i+1) > > if (i != j) { > > tmp <- seq[i] > > seq[i] <- seq[j] > > seq[j] <- tmp > > } > > i <- i - 1 > > } > > paste(seq, collapse="") > > } > > > > and this is an example of how to use it: > > > > # generate some random DNA sequence of length 3 > > seq <- paste(sample(c("A","C","G","T"), size=3, > > replace=TRUE), collapse="") > > seq > > [1] "ATG" > > > > # shuffle the sequence 10,000 times and store all the > > permutations > > shuffledseqs <- c() > > for (i in 1:10000) shuffledseqs <- c(shuffledseqs, > > fy(seq)) > > > > # verify that indeed the permutations occur uniformly > > at random > > base::table(shuffledseqs)/10000 > > shuffledseqs > > AGT ATG GAT GTA TAG TGA 0.1693 0.1682 > > 0.1678 0.1629 0.1610 0.1708 1/6 > > [1] 0.1666667 > > > > > > thanks! > > robert. > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > -- > > Hervé Pagès > > > > Program in Computational Biology > > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N, M2-B876 > > P.O. Box 19024 > > Seattle, WA 98109-1024 > > > > E-mail: hpages at fhcrc.org > > Phone: (206) 667-5791 > > Fax: (206) 667-1319 > > > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 16.6 years ago Ludo Pagie ▴ 20

0

Entering edit mode

hi Herv?, On Tue, 2009-06-23 at 16:05 -0700, Hervé Pagès wrote: > Hi Robert, > > Robert Castelo wrote: > > dear list, > > > > i'm interested in using some proper DNA shuffling procedure like the > > Fisher-Yates algorithm: > > > > R. A. Fisher and F. Yates, Example 12, Statistical Tables, London, 1938. > > > > Richard Durstenfeld, Algorithm 235: Random permutation, CACM 7(7):420, > > July 1964. > > The Fisher & Yates or Durstenfeld algos are just particular implementations > of a ramdom permutation generator. > R has the sample() function to generate a random permutation. The man page > doesn't specify what implementation is used but does that really matter? ups, you're right, i was so focused in finding something specific for DNA in Biostrings that i forgot to try to use the sample() function, how embarrassing.. X-p thanks! robert. > > > > i've been googling bioconductor and the r-project for this, and > > particularly looking at the Biostrings package, but I've failed to find > > anything. just in case i'm missing some implementation of this DNA > > shuffling in R, i'd like to ask the list whether anybody knows of that > > procedure being implemented in R or, even better, in a bioconductor > > package such that would work with DNAString objects. > > > > the reason for that is that i would like to simulate random DNA strings > > preserving nucleotide composition in order to calculate empirical > > p-values for motif scanning. > > Generate a random DNA string (of length 1000) with specific > nucleotide probabilities: > > x <- paste(sample(c("A", "C", "G", "T"), 1000, replace=TRUE, > prob=c(0.2, 0.55, 0.1, 0.15)), collapse="") > x <- DNAString(x) > > Then: > > > alphabetFrequency(x, baseOnly=TRUE) > A C G T other > 214 531 110 145 0 > > If you really want to shuffle a given DNAString object x0 in order to > preserve its nucleotide composition: > > x1 <- x0[sample(length(x0))] > > For example, shuffling the 'x' obtained above gives: > > x1 <- x[sample(length(x))] > > > alphabetFrequency(x1, baseOnly=TRUE) > A C G T other > 214 531 110 145 0 > > Performance: > > > system.time(for (i in 1:10000) {y <- x[sample(length(x))]}) > user system elapsed > 5.760 0.008 5.764 > > Everything is already here and probably fast enough. No need to > re-implement anything. > > Cheers, > H. > > > > > > if it is not yet implemented i would like to suggest the package > > maintainers of Biostrings or any other biological sequence > > infrastructure package in Bioconductor to have it implemented in C for > > having maximum speed with this kind of simulations. just in case it > > eases my request here goes some simple R code implementing the > > Fisher-Yates shuffling algorithm: > > > > fy <- function(seq) { > > seq <- unlist(strsplit(seq, "")) > > n <- length(seq) > > i <- n > > while (i > 0) { > > j <- floor(runif(1)*i+1) > > if (i != j) { > > tmp <- seq[i] > > seq[i] <- seq[j] > > seq[j] <- tmp > > } > > i <- i - 1 > > } > > paste(seq, collapse="") > > } > > > > and this is an example of how to use it: > > > > # generate some random DNA sequence of length 3 > > seq <- paste(sample(c("A","C","G","T"), size=3, replace=TRUE), collapse="") > > seq > > [1] "ATG" > > > > # shuffle the sequence 10,000 times and store all the permutations > > shuffledseqs <- c() > > for (i in 1:10000) shuffledseqs <- c(shuffledseqs, fy(seq)) > > > > # verify that indeed the permutations occur uniformly at random > > base::table(shuffledseqs)/10000 > > shuffledseqs > > AGT ATG GAT GTA TAG TGA > > 0.1693 0.1682 0.1678 0.1629 0.1610 0.1708 > > 1/6 > > [1] 0.1666667 > > > > > > thanks! > > robert. > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 16.6 years ago Robert Castelo ★ 3.4k

Login before adding your answer.