GRanges performance issue

0

Entering edit mode

arne.mueller@novartis.com ▴ 200

@arnemuellernovartiscom-2205

Last seen 8.5 years ago

Switzerland

Hello, I realized there's a massive performance difference to subset Granges objects by name compared to the Granges subset method. Example: > length(mm9.tiled) [1] 5309835 > n = names(mm9.tiled) > rn = sample(n, 1000) > system.time(tmp <- subset(mm9.tiled, names(mm9.tiled) %in% rn)) user system elapsed 1.610 0.131 1.741 > system.time(tmp <- mm9.tiled[rn]) user system elapsed 72.793 0.167 72.976 > > sessionInfo() R version 2.14.0 Under development (unstable) (2011-06-01 r56028) Platform: x86_64-unknown-linux-gnu/x86_64 (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] GenomicRanges_1.5.12 IRanges_1.11.10 loaded via a namespace (and not attached): [1] tools_2.14.0 Is this a known (wanted?) behavior? Regards, Arne [[alternative HTML version deleted]]

• 736 views

ADD COMMENT • link updated 12.8 years ago by Hervé Pagès 16k • written 12.8 years ago by arne.mueller@novartis.com ▴ 200

0

Entering edit mode

Vincent J. Carey, Jr. 6.7k

@vincent-j-carey-jr-4

Last seen 5 weeks ago

United States

I don't think this is a "wanted" behavior, but the two computations are fairly different. I believe you don't have to use subset() -- as long as you know the indices you want, numerically or logically, the bracket should work just as fast. On Thu, Jul 7, 2011 at 11:45 AM, Mueller, Arne <arne.mueller@novartis.com>wrote: > Hello, > > I realized there's a massive performance difference to subset Granges > objects by name compared to the Granges subset method. > > Example: > > > length(mm9.tiled) > [1] 5309835 > > n = names(mm9.tiled) > > rn = sample(n, 1000) > > system.time(tmp <- subset(mm9.tiled, names(mm9.tiled) %in% rn)) > user system elapsed > 1.610 0.131 1.741 > > system.time(tmp <- mm9.tiled[rn]) > user system elapsed > 72.793 0.167 72.976 > > > > sessionInfo() > R version 2.14.0 Under development (unstable) (2011-06-01 r56028) > Platform: x86_64-unknown-linux-gnu/x86_64 (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > other attached packages: > [1] GenomicRanges_1.5.12 IRanges_1.11.10 > > loaded via a namespace (and not attached): > [1] tools_2.14.0 > > > Is this a known (wanted?) behavior? > > Regards, > > Arne > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 12.8 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 9 minutes ago

Seattle, WA, United States

Hi Arne, On 11-07-07 08:45 AM, Mueller, Arne wrote: > Hello, > > I realized there's a massive performance difference to subset Granges objects by name compared to the Granges subset method. > > Example: > >> length(mm9.tiled) > [1] 5309835 >> n = names(mm9.tiled) >> rn = sample(n, 1000) >> system.time(tmp<- subset(mm9.tiled, names(mm9.tiled) %in% rn)) > user system elapsed > 1.610 0.131 1.741 >> system.time(tmp<- mm9.tiled[rn]) > user system elapsed > 72.793 0.167 72.976 Note that subsetting with mm9.tiled[rn] # A is not the same as subsetting with mm9.tiled[names(mm9.tiled) %in% rn] # B because the latter does not reorder the elements. An equivalent to A would rather be mm9.tiled[match(rn, names(mm9.tiled)] # C and yes, C is also much faster than A (50x faster on my machine for a GRanges with 1 million elts). I agree that this can hardly be justified: I don't see any reason why A couldn't be made as fast as C (or almost). I believe the culprit is the call to IRanges:::.bracket.Index() in the "[" method for "GRanges" objects. I'll try to come up with a fix. Thanks for reporting this. H. >> >> sessionInfo() > R version 2.14.0 Under development (unstable) (2011-06-01 r56028) > Platform: x86_64-unknown-linux-gnu/x86_64 (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > other attached packages: > [1] GenomicRanges_1.5.12 IRanges_1.11.10 > > loaded via a namespace (and not attached): > [1] tools_2.14.0 > > > Is this a known (wanted?) behavior? > > Regards, > > Arne > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 12.8 years ago Hervé Pagès 16k

Login before adding your answer.