Getting the length of every element from a large CompressedIRangesList is slow
1
0
Entering edit mode
@delhommeemblde-3232
Last seen 9.7 years ago
Hej! I've a rather large CompressedIRangesList >print(object.size(aln.ranges),unit="Mb") 390.4 Mb that has 2518 elements, some of which having up to 6M ranges for a total of 51M, but the vast majority are small, the median is 2 while the mean is ~ 20,000 (the 3rd quartile has a value of 47). Retrieving the element length is slow: >system.time(sizes <- sapply(aln.ranges,length)) user system elapsed 265.777 169.222 443.498 by comparison to the performances of the IRanges package in general, which I was surprised of. Are there faster way to get this information than the sapply I'm using? Note that the machine I'm using is not a limiting factor in terms of CPU/RAM/load. > sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C/UTF-8/C/C/C/C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] IRanges_1.15.15 BiocGenerics_0.3.0 loaded via a namespace (and not attached): [1] stats4_2.15.1 Nico P.S. If you need, I can send my aln.ranges object off-list. --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany
IRanges IRanges • 788 views
ADD COMMENT
0
Entering edit mode
@delhommeemblde-3232
Last seen 9.7 years ago
Hi, Just to extend on my previous message: Doing this instead is fast: > system.time(sizes <- sapply(width(aln.ranges),length)) user system elapsed 1.109 0.144 1.254 Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On Jul 2, 2012, at 7:02 PM, Nicolas Delhomme wrote: > Hej! > > I've a rather large CompressedIRangesList > >> print(object.size(aln.ranges),unit="Mb") > 390.4 Mb > > that has 2518 elements, some of which having up to 6M ranges for a total of 51M, but the vast majority are small, the median is 2 while the mean is ~ 20,000 (the 3rd quartile has a value of 47). > > Retrieving the element length is slow: > >> system.time(sizes <- sapply(aln.ranges,length)) > > user system elapsed > 265.777 169.222 443.498 > > by comparison to the performances of the IRanges package in general, which I was surprised of. Are there faster way to get this information than the sapply I'm using? Note that the machine I'm using is not a limiting factor in terms of CPU/RAM/load. > >> sessionInfo() > R version 2.15.1 (2012-06-22) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] C/UTF-8/C/C/C/C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] IRanges_1.15.15 BiocGenerics_0.3.0 > > loaded via a namespace (and not attached): > [1] stats4_2.15.1 > > Nico > > P.S. If you need, I can send my aln.ranges object off-list. > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Hi Nico, Even faster: > system.time(sizes <- elementLengths(exbytx)) user system elapsed 0.000 0.000 0.001 Note that you can use elementLengths on any list-like object ("list-like" = list or List class or subclass): > x <- rep(list(a=1:4, b=letters), 500000) > length(x) [1] 1000000 > system.time(x_eltlens <- sapply(x, length)) user system elapsed 3.132 0.008 3.142 > system.time(x_eltlens2 <- elementLengths(x)) user system elapsed 0.024 0.000 0.023 > identical(x_eltlens, x_eltlens2) [1] TRUE HTH, H. On 07/02/2012 10:18 AM, Nicolas Delhomme wrote: > Hi, > > Just to extend on my previous message: > > Doing this instead is fast: > >> system.time(sizes <- sapply(width(aln.ranges),length)) > > user system elapsed > 1.109 0.144 1.254 > > Cheers, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On Jul 2, 2012, at 7:02 PM, Nicolas Delhomme wrote: > >> Hej! >> >> I've a rather large CompressedIRangesList >> >>> print(object.size(aln.ranges),unit="Mb") >> 390.4 Mb >> >> that has 2518 elements, some of which having up to 6M ranges for a total of 51M, but the vast majority are small, the median is 2 while the mean is ~ 20,000 (the 3rd quartile has a value of 47). >> >> Retrieving the element length is slow: >> >>> system.time(sizes <- sapply(aln.ranges,length)) >> >> user system elapsed >> 265.777 169.222 443.498 >> >> by comparison to the performances of the IRanges package in general, which I was surprised of. Are there faster way to get this information than the sapply I'm using? Note that the machine I'm using is not a limiting factor in terms of CPU/RAM/load. >> >>> sessionInfo() >> R version 2.15.1 (2012-06-22) >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> locale: >> [1] C/UTF-8/C/C/C/C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] IRanges_1.15.15 BiocGenerics_0.3.0 >> >> loaded via a namespace (and not attached): >> [1] stats4_2.15.1 >> >> Nico >> >> P.S. If you need, I can send my aln.ranges object off-list. >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Genome Biology Computational Support >> >> European Molecular Biology Laboratory >> >> Tel: +49 6221 387 8310 >> Email: nicolas.delhomme at embl.de >> Meyerhofstrasse 1 - Postfach 10.2209 >> 69102 Heidelberg, Germany >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
That's great! Thanks Herv?. I remember seeing that in a thread in the mailing list, but couldn't recall it. And I couldn't find it in the documentation. Could it made more obvious by being added to the IRangesList Rd page, as part of the "see also" section, as well as in the IRangesList-utils Rd page? That would be great too :-) Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On Jul 2, 2012, at 8:25 PM, Hervé Pagès wrote: > Hi Nico, > > Even faster: > > > system.time(sizes <- elementLengths(exbytx)) > user system elapsed > 0.000 0.000 0.001 > > Note that you can use elementLengths on any list-like object > ("list-like" = list or List class or subclass): > > > x <- rep(list(a=1:4, b=letters), 500000) > > length(x) > [1] 1000000 > > system.time(x_eltlens <- sapply(x, length)) > user system elapsed > 3.132 0.008 3.142 > > system.time(x_eltlens2 <- elementLengths(x)) > user system elapsed > 0.024 0.000 0.023 > > identical(x_eltlens, x_eltlens2) > [1] TRUE > > HTH, > > H. > > On 07/02/2012 10:18 AM, Nicolas Delhomme wrote: >> Hi, >> >> Just to extend on my previous message: >> >> Doing this instead is fast: >> >>> system.time(sizes <- sapply(width(aln.ranges),length)) >> >> user system elapsed >> 1.109 0.144 1.254 >> >> Cheers, >> >> Nico >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Genome Biology Computational Support >> >> European Molecular Biology Laboratory >> >> Tel: +49 6221 387 8310 >> Email: nicolas.delhomme at embl.de >> Meyerhofstrasse 1 - Postfach 10.2209 >> 69102 Heidelberg, Germany >> --------------------------------------------------------------- >> >> >> >> >> >> On Jul 2, 2012, at 7:02 PM, Nicolas Delhomme wrote: >> >>> Hej! >>> >>> I've a rather large CompressedIRangesList >>> >>>> print(object.size(aln.ranges),unit="Mb") >>> 390.4 Mb >>> >>> that has 2518 elements, some of which having up to 6M ranges for a total of 51M, but the vast majority are small, the median is 2 while the mean is ~ 20,000 (the 3rd quartile has a value of 47). >>> >>> Retrieving the element length is slow: >>> >>>> system.time(sizes <- sapply(aln.ranges,length)) >>> >>> user system elapsed >>> 265.777 169.222 443.498 >>> >>> by comparison to the performances of the IRanges package in general, which I was surprised of. Are there faster way to get this information than the sapply I'm using? Note that the machine I'm using is not a limiting factor in terms of CPU/RAM/load. >>> >>>> sessionInfo() >>> R version 2.15.1 (2012-06-22) >>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>> >>> locale: >>> [1] C/UTF-8/C/C/C/C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] IRanges_1.15.15 BiocGenerics_0.3.0 >>> >>> loaded via a namespace (and not attached): >>> [1] stats4_2.15.1 >>> >>> Nico >>> >>> P.S. If you need, I can send my aln.ranges object off-list. >>> >>> --------------------------------------------------------------- >>> Nicolas Delhomme >>> >>> Genome Biology Computational Support >>> >>> European Molecular Biology Laboratory >>> >>> Tel: +49 6221 387 8310 >>> Email: nicolas.delhomme at embl.de >>> Meyerhofstrasse 1 - Postfach 10.2209 >>> 69102 Heidelberg, Germany >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > >
ADD REPLY
0
Entering edit mode
Nico, On 07/03/2012 12:36 AM, Nicolas Delhomme wrote: > That's great! Thanks Herv?. > > I remember seeing that in a thread in the mailing list, but couldn't recall it. And I couldn't find it in the documentation. Could it made more obvious by being added to the IRangesList Rd page, as part of the "see also" section, as well as in the IRangesList-utils Rd page? That would be great too :-) Good point. The elementLengths() generic is documented in the man page for List because, like [[, elementType(), lapply(), endoapply(), etc... it's a basic functionality of any List object, i.e. of any object that belongs to a concrete subclass of List. Note that there are more than 90 List subclasses defined in the IRanges package. Each subclass of course inherits all the methods defined for all the parent classes and defines its own specific generic/methods. IRangesList derives from List via RangesList: List <-- RangesList <-- IRangesList What was missing was a "see also" section in the man page for the RangesList class that points to the man page for List. I just added it in IRanges 1.15.19. Hopefully that will make it easier for the user to discover elementLengths() as well as any of the other basic List functionalities. Cheers, H. > > Cheers, > > Nico > > > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On Jul 2, 2012, at 8:25 PM, Hervé Pagès wrote: > >> Hi Nico, >> >> Even faster: >> >> > system.time(sizes <- elementLengths(exbytx)) >> user system elapsed >> 0.000 0.000 0.001 >> >> Note that you can use elementLengths on any list-like object >> ("list-like" = list or List class or subclass): >> >> > x <- rep(list(a=1:4, b=letters), 500000) >> > length(x) >> [1] 1000000 >> > system.time(x_eltlens <- sapply(x, length)) >> user system elapsed >> 3.132 0.008 3.142 >> > system.time(x_eltlens2 <- elementLengths(x)) >> user system elapsed >> 0.024 0.000 0.023 >> > identical(x_eltlens, x_eltlens2) >> [1] TRUE >> >> HTH, >> >> H. >> >> On 07/02/2012 10:18 AM, Nicolas Delhomme wrote: >>> Hi, >>> >>> Just to extend on my previous message: >>> >>> Doing this instead is fast: >>> >>>> system.time(sizes <- sapply(width(aln.ranges),length)) >>> >>> user system elapsed >>> 1.109 0.144 1.254 >>> >>> Cheers, >>> >>> Nico >>> >>> --------------------------------------------------------------- >>> Nicolas Delhomme >>> >>> Genome Biology Computational Support >>> >>> European Molecular Biology Laboratory >>> >>> Tel: +49 6221 387 8310 >>> Email: nicolas.delhomme at embl.de >>> Meyerhofstrasse 1 - Postfach 10.2209 >>> 69102 Heidelberg, Germany >>> --------------------------------------------------------------- >>> >>> >>> >>> >>> >>> On Jul 2, 2012, at 7:02 PM, Nicolas Delhomme wrote: >>> >>>> Hej! >>>> >>>> I've a rather large CompressedIRangesList >>>> >>>>> print(object.size(aln.ranges),unit="Mb") >>>> 390.4 Mb >>>> >>>> that has 2518 elements, some of which having up to 6M ranges for a total of 51M, but the vast majority are small, the median is 2 while the mean is ~ 20,000 (the 3rd quartile has a value of 47). >>>> >>>> Retrieving the element length is slow: >>>> >>>>> system.time(sizes <- sapply(aln.ranges,length)) >>>> >>>> user system elapsed >>>> 265.777 169.222 443.498 >>>> >>>> by comparison to the performances of the IRanges package in general, which I was surprised of. Are there faster way to get this information than the sapply I'm using? Note that the machine I'm using is not a limiting factor in terms of CPU/RAM/load. >>>> >>>>> sessionInfo() >>>> R version 2.15.1 (2012-06-22) >>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>>> >>>> locale: >>>> [1] C/UTF-8/C/C/C/C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> other attached packages: >>>> [1] IRanges_1.15.15 BiocGenerics_0.3.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] stats4_2.15.1 >>>> >>>> Nico >>>> >>>> P.S. If you need, I can send my aln.ranges object off-list. >>>> >>>> --------------------------------------------------------------- >>>> Nicolas Delhomme >>>> >>>> Genome Biology Computational Support >>>> >>>> European Molecular Biology Laboratory >>>> >>>> Tel: +49 6221 387 8310 >>>> Email: nicolas.delhomme at embl.de >>>> Meyerhofstrasse 1 - Postfach 10.2209 >>>> 69102 Heidelberg, Germany >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY

Login before adding your answer.

Traffic: 708 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6