GRanges performance issue
2
0
Entering edit mode
@arnemuellernovartiscom-2205
Last seen 9.2 years ago
Switzerland
Hello, I realized there's a massive performance difference to subset Granges objects by name compared to the Granges subset method. Example: > length(mm9.tiled) [1] 5309835 > n = names(mm9.tiled) > rn = sample(n, 1000) > system.time(tmp <- subset(mm9.tiled, names(mm9.tiled) %in% rn)) user system elapsed 1.610 0.131 1.741 > system.time(tmp <- mm9.tiled[rn]) user system elapsed 72.793 0.167 72.976 > > sessionInfo() R version 2.14.0 Under development (unstable) (2011-06-01 r56028) Platform: x86_64-unknown-linux-gnu/x86_64 (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] GenomicRanges_1.5.12 IRanges_1.11.10 loaded via a namespace (and not attached): [1] tools_2.14.0 Is this a known (wanted?) behavior? Regards, Arne [[alternative HTML version deleted]]
• 867 views
ADD COMMENT
0
Entering edit mode
@vincent-j-carey-jr-4
Last seen 4 days ago
United States
I don't think this is a "wanted" behavior, but the two computations are fairly different. I believe you don't have to use subset() -- as long as you know the indices you want, numerically or logically, the bracket should work just as fast. On Thu, Jul 7, 2011 at 11:45 AM, Mueller, Arne <arne.mueller@novartis.com>wrote: > Hello, > > I realized there's a massive performance difference to subset Granges > objects by name compared to the Granges subset method. > > Example: > > > length(mm9.tiled) > [1] 5309835 > > n = names(mm9.tiled) > > rn = sample(n, 1000) > > system.time(tmp <- subset(mm9.tiled, names(mm9.tiled) %in% rn)) > user system elapsed > 1.610 0.131 1.741 > > system.time(tmp <- mm9.tiled[rn]) > user system elapsed > 72.793 0.167 72.976 > > > > sessionInfo() > R version 2.14.0 Under development (unstable) (2011-06-01 r56028) > Platform: x86_64-unknown-linux-gnu/x86_64 (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > other attached packages: > [1] GenomicRanges_1.5.12 IRanges_1.11.10 > > loaded via a namespace (and not attached): > [1] tools_2.14.0 > > > Is this a known (wanted?) behavior? > > Regards, > > Arne > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States
Hi Arne, On 11-07-07 08:45 AM, Mueller, Arne wrote: > Hello, > > I realized there's a massive performance difference to subset Granges objects by name compared to the Granges subset method. > > Example: > >> length(mm9.tiled) > [1] 5309835 >> n = names(mm9.tiled) >> rn = sample(n, 1000) >> system.time(tmp<- subset(mm9.tiled, names(mm9.tiled) %in% rn)) > user system elapsed > 1.610 0.131 1.741 >> system.time(tmp<- mm9.tiled[rn]) > user system elapsed > 72.793 0.167 72.976 Note that subsetting with mm9.tiled[rn] # A is not the same as subsetting with mm9.tiled[names(mm9.tiled) %in% rn] # B because the latter does not reorder the elements. An equivalent to A would rather be mm9.tiled[match(rn, names(mm9.tiled)] # C and yes, C is also much faster than A (50x faster on my machine for a GRanges with 1 million elts). I agree that this can hardly be justified: I don't see any reason why A couldn't be made as fast as C (or almost). I believe the culprit is the call to IRanges:::.bracket.Index() in the "[" method for "GRanges" objects. I'll try to come up with a fix. Thanks for reporting this. H. >> >> sessionInfo() > R version 2.14.0 Under development (unstable) (2011-06-01 r56028) > Platform: x86_64-unknown-linux-gnu/x86_64 (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > other attached packages: > [1] GenomicRanges_1.5.12 IRanges_1.11.10 > > loaded via a namespace (and not attached): > [1] tools_2.14.0 > > > Is this a known (wanted?) behavior? > > Regards, > > Arne > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT

Login before adding your answer.

Traffic: 630 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6