Creating DataFrame uses too much memory
1
0
Entering edit mode
Ludo Pagie ▴ 40
@ludo-pagie-6130
Last seen 8.1 years ago
Dear all, I know I'm on slippery slopes here with stating some R-feature uses too much memory. Still I find something odd which hampers me in getting my job done. I make a DataFrame from a large integer matrix (rowaxcols = 7e6 x 300) and the process of creating the DataFrame consumes more memory than I have. I'm working on a Ubuntu machine. My question is: Am I overlooking something; can I change my code such that memory overhead is more reasonable. Or is there a problem with the implementation of DataFrame's, or is there another issue? Thanks for helping me out. Ludo (smaller) USE CASE: ################## # load IRanges library(IRanges) # create a largish matrix mm <- matrix(as.integer(NA),1e7,20) # consumes about 800 Mb print(object.size(mm), unit="Mb") # 762.9 Mb # at this point program top suggests that my R job consumes 900Mb (VIRT) ? 833Mb (RES) memory. That's reasonable I think # coerce the matrix into a regular data.frame df <- as.data.frame(mm) # also consumes about 800Mb print(object.size(df), unit="Mb") # 762.9 Mb # at this point program top suggests that my R job consumes 1800Mb (VIRT) ? 1750Mb (RES) memory. That's reasonable I think; about double the size as before. # corerce the same matrix into a DataFrame DF <- as(mm, "DataFrame") # also consumes about 800Mb print(object.size(DF), unit="Mb") # 762.9 Mb # but now top says that my R job takes 5500Mb (VIRT) / 5400Mb (RES) memory!!! That's 3700Mb for coercing a 800Mb object # sessioninfo > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=nl_NL.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] IRanges_1.20.6 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] stats4_3.0.2 ######################################################## --- Ludo Pagie Netherlands Cancer Institute Gene Regulation (B4) van Steensel Group Plesmanlaan 121 1066 CX Amsterdam The Netherlands Tel.: ++ 20 512 7986 Fax: ++ 20 669 1383 email: l.pagie@nki.nl [[alternative HTML version deleted]]
Cancer PROcess Cancer PROcess • 1.9k views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 2.3 years ago
United States
This made my day. You're looking for R 3.1: > mm <- matrix(as.integer(NA),1e7,20) > print(object.size(mm), unit="Mb") 762.9 Mb > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 799374 42.7 1265230 67.6 984024 52.6 Vcells 100903855 769.9 111583218 851.4 100907462 769.9 > df <- as.data.frame(mm) > print(object.size(df), unit="Mb") 762.9 Mb > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 799762 42.8 1265230 67.6 984024 52.6 Vcells 200904509 1532.8 231527856 1766.5 215905175 1647.3 > DF <- as(mm, "DataFrame") > print(object.size(DF), unit="Mb") 762.9 Mb > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 800938 42.8 1265230 67.6 984024 52.6 Vcells 300906552 2295.8 342980844 2616.8 310905457 2372.1 So at the end we are (rightfully) consuming 2295.8 MB, having only consumed 2372.1 MB during the operation. Thanks, Michael On Tue, Jun 17, 2014 at 4:42 AM, Ludo Pagie <ludo.pagie@gmail.com> wrote: > Dear all, > > I know I'm on slippery slopes here with stating some R-feature uses too > much memory. Still I find something odd which hampers me in getting my job > done. I make a DataFrame from a large integer matrix (rowaxcols = 7e6 x > 300) and the process of creating the DataFrame consumes more memory than I > have. I'm working on a Ubuntu machine. > > My question is: Am I overlooking something; can I change my code such that > memory overhead is more reasonable. Or is there a problem with the > implementation of DataFrame's, or is there another issue? > > Thanks for helping me out. Ludo > > (smaller) USE CASE: > ################## > > # load IRanges > library(IRanges) > > # create a largish matrix > mm <- matrix(as.integer(NA),1e7,20) > # consumes about 800 Mb > print(object.size(mm), unit="Mb") > # 762.9 Mb > > # at this point program top suggests that my R job consumes 900Mb (VIRT) ? > 833Mb (RES) memory. That's reasonable I think > > # coerce the matrix into a regular data.frame > df <- as.data.frame(mm) > # also consumes about 800Mb > print(object.size(df), unit="Mb") > # 762.9 Mb > > # at this point program top suggests that my R job consumes 1800Mb (VIRT) ? > 1750Mb (RES) memory. That's reasonable I think; about double the size as > before. > > # corerce the same matrix into a DataFrame > DF <- as(mm, "DataFrame") > # also consumes about 800Mb > print(object.size(DF), unit="Mb") > # 762.9 Mb > > # but now top says that my R job takes 5500Mb (VIRT) / 5400Mb (RES) > memory!!! That's 3700Mb for coercing a 800Mb object > > # sessioninfo > > sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=nl_NL.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] IRanges_1.20.6 BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] stats4_3.0.2 > ######################################################## > > > --- > Ludo Pagie > Netherlands Cancer Institute > Gene Regulation (B4) > van Steensel Group > Plesmanlaan 121 > 1066 CX Amsterdam > The Netherlands > > Tel.: ++ 20 512 7986 > Fax: ++ 20 669 1383 > email: l.pagie@nki.nl > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT

Login before adding your answer.

Traffic: 763 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6