DNASuffixArray
1
0
Entering edit mode
@gabriel-valiente-3007
Last seen 9.6 years ago
I post it again, in the hope Herve Pages is already back. Thanks, Gabriel > The guy who is most on top of this is undoubtedly Herve Pages from > FHCRC, since he wrote it. He is on vacation until Sep. 1st. Of > course, there may be other people with details on this. > > Kasper > > On Aug 29, 2008, at 10:35 AM, Gabriel Valiente wrote: > >> Dear all, >> >> there was a DNASuffixArray function call to C code (biostring.c) >> in package >> >> Biostrings 1.1.1 (BioConductor 1.6) >> Biostrings 1.4.0 (BioConductor 1.7) >> Biostrings 2.0.3 (BioConductor 1.8) >> Biostrings 2.2.1 (BioConductor 1.9) >> >> but then, still a function call but no C code anymore, in package >> >> Biostrings 2.4.8 (BioConductor 2.0) >> Biostrings 2.6.6 (BioConductor 2.1) >> Biostrings 2.8.18 (BioConductor 2.2) >> Biostrings 2.9.66 (BioConductor 2.3) >> >> Can someone please briefly explain what happened to >> DNASuffixArray? Beside being interested in using it, I would also >> like to properly cite this implementation in a forthcoming book >> and thus, need to be sure about any details. Thanks, >> >> Gabriel >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/ >> gmane.science.biology.informatics.conductor
Biostrings Biostrings • 680 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 23 hours ago
Seattle, WA, United States
Hi Gabriel, Yes the DNASuffixArray() function has been put aside during the transition from Biostrings 1 to Biostrings 2. When I took over the package as the main maintainer in 2005, I redesigned the whole class hierarchy and ended up having to rewrite almost everything to make things work with the new infrastructure. Biostrings 1 had a single R file (String.R) that I kept around in Biostrings 2 (in the Biostrings1/ folder): it serves as a reference for what functionalities were available in Biostrings 1 and what still needs to be ported to the 2 series. DNASuffixArray() and LongestCommonPrefix() are in fact the only functionalities that are still missing. I never took the time to put them back in the 2 series, sorry. I guess my focus was on other functionalities: the development of the package is mostly driven by people's most urgent needs and nobody asked about DNASuffixArray() so far. Maybe because there are other packages from CRAN or the Omegahat projects that are dedicated to Suffix Trees and Suffix Arrays. Anyway, with today's Biostrings much revised code, you can build the suffix array of a DNA sequence in one line: sort(DNAStringSet(views(x0, start=seq_len(nchar(x0))))) Surprisingly this is even faster than DNASuffixArray() from Biostrings 1 which was entirely implemented in C (it used a fancy LSD-first sorting algo). With a 10M-bp random sequence: x0 <- DNAString(paste(sample(c("A", "C", "G", "T"), 10000000, replace=TRUE), collapse="")) With R-2.7.1 + Biostrings 1.4.0: sa1 <- DNASuffixArray(x0) ==> Takes: 327 sec. on a laptop with 1GB / 242 sec. on a 64-bit Linux server With R-2.8 + Biostrings 2.9.69: sa2 <- sort(DNAStringSet(views(x0, start=seq_len(nchar(x0))))) ==> Takes: 77 sec. on a laptop with 1GB / 24.5 sec. on a 64-bit Linux server So depending on the hardware, the Biostrings 2 solution is between 4 and 10 times faster than DNASuffixArray(). Note that: o The "sort" method for XStringSet objects was just added yesterday to the devel version of Biostrings (2.9.69). It uses qsort() from the standard C lib. Other related methods ("duplicated", "unique", "order", etc) will follow soon. o The order defined between DNA sequences has changed between Biostrings 1 and Biostrings 2: in Biostrings 1: T < G < C < A in Biostrings 2: A < C < G < T so elements in 'sa1' and 'sa2' are ordered differently. o The sort-based solution is not limited to DNA sequences (DNAString objects): it also works with BString, RNAString and AAString objects. I'll put DNASuffixArray() and LongestCommonPrefix() back in the package in the next days. Then all Biostrings 1 functionalities will be back and I will finally be able to remove the Biostrings1/ folder. Cheers, H. Gabriel Valiente wrote: > I post it again, in the hope Herve Pages is already back. Thanks, > > Gabriel > >> The guy who is most on top of this is undoubtedly Herve Pages from >> FHCRC, since he wrote it. He is on vacation until Sep. 1st. Of course, >> there may be other people with details on this. >> >> Kasper >> >> On Aug 29, 2008, at 10:35 AM, Gabriel Valiente wrote: >> >>> Dear all, >>> >>> there was a DNASuffixArray function call to C code (biostring.c) in >>> package >>> >>> Biostrings 1.1.1 (BioConductor 1.6) >>> Biostrings 1.4.0 (BioConductor 1.7) >>> Biostrings 2.0.3 (BioConductor 1.8) >>> Biostrings 2.2.1 (BioConductor 1.9) >>> >>> but then, still a function call but no C code anymore, in package >>> >>> Biostrings 2.4.8 (BioConductor 2.0) >>> Biostrings 2.6.6 (BioConductor 2.1) >>> Biostrings 2.8.18 (BioConductor 2.2) >>> Biostrings 2.9.66 (BioConductor 2.3) >>> >>> Can someone please briefly explain what happened to DNASuffixArray? >>> Beside being interested in using it, I would also like to properly >>> cite this implementation in a forthcoming book and thus, need to be >>> sure about any details. Thanks, >>> >>> Gabriel >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT

Login before adding your answer.

Traffic: 838 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6