Question: How to handle big data sets (>10GB) in Bioconductor packages?
0
12 months ago by
Houston, TX
perdedorium0 wrote:

I am running an R script that uses the package MethylMix to download and preprocess all the available methylation data sets from TCGA. However, when I try to process the 450K breast cancer methylation data set (size ~13GB), I get a "Cannot allocate vector of size 12.8 GB" error.

I am running R 3.4.0 on 64-bit x86_64_pc-linux-gnu using my school's computing cluster, and each node has the following properties:

• Dual Socket
• Xeon E5-2690 v3 (Haswell) : 12 cores per socket (24 cores/node), 2.6 GHz
• 64 GB DDR4-2133 (8 x 8GB dual rank x8 DIMMS)
• No local disk

so it seems as though there should be enough memory for this operation. The operating system is Linux, so I thought that R will just use all available memory, unlike on Windows? And checking the process memory using ulimit returns "unlimited." I am not sure where the problem lies. My script is a loop that iterates over all cancers available on TCGA, if that makes any difference.

sessionInfo():
R version 3.4.0 (2017-04-21)
Platform: x86_64_pc-linux-gnu (64-bit)
Matrix products: default
BLAS/LAPACK: /opt/apps/intel/16.0.1.150/compilers_and_libraries_2016.1.150/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.so
attached base packages: stats, graphics grDevices, utils, datasets, methods, base
loaded via a namespace (and not attached): compiler_3.4.0

Code (inside for() loop):

# Download methylation data.
METdirectories <- tryCatch(
{
}, warning = function(w) {
# For warnings, write them to the output file.
}, error = function(e) {
# For errors, write them to the output file and then skip to the next cancer.
return(NULL)
}, finally = {
# If everything went all right, make a note of that in the output file.
cat(paste("Successfully downloaded methylation data for cancer", i), file=logfile, append=TRUE, sep = "\n")
}
)
if(is.null(METdirectories)) next

# Process methylation data.
METProcessedData <- tryCatch(
{
Preprocess_DNAmethylation(i, METdirectories)
}, warning = function(w) {
# For warnings, write them to the output file.
cat(paste("Warning in cancer", i, "when processing methylation data:", conditionMessage(w)), file=logfile, append=TRUE, sep = "\n")
}, error = function(e) {
# For errors, write them to the output file and then skip to the next cancer.
cat(paste("Error in cancer", i, "when processing methylation data:", conditionMessage(e)), file=logfile, append=TRUE, sep = "\n")
return(NULL)
}, finally = {
# If everything went all right, make a note of that in the output file.
cat(paste("Successfully processed methylation data for cancer", i), file=logfile, append=TRUE, sep = "\n")
}
)
if(is.null(METProcessedData)) next
# Save methylation processed data.
saveRDS(METProcessedData, file=paste0(paste0(targetDirectory, "/Methylation/"), "MET_", i, "_Processed.rds"))
modified 12 months ago by Martin Morgan ♦♦ 23k • written 12 months ago by perdedorium0
Answer: How to handle big data sets (>10GB) in Bioconductor packages?
0
12 months ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:

Because R uses 'copy-on-change' semantics, rather than reference-based semantics, it will often make several copies of data during, e.g., function calls. A rule of thumb, that I have no basis for, is that R will typically use 4x the memory of the largest object, so it is in some ways not surprising that you run out of memory here.

It could well be that the function you're calling is written so that it uses more memory than required; hopefully the maintainer will respond here, profile their code, and arrive at a more efficient implementation.

1

Out of curiosity I thought I'd run this on a machine with 1TB of RAM to see how much is actually required.  My first comment is that I can see why you'd want to do it on a cluster, after 24hrs I've only manged to process 10% of the breast cancer dataset.  More pertinently, here's the last line of the processing output, followed by a call to gc() after I stopped the process:

Starting batch 1 of 33Starting batch 2 of 33Starting batch 3 of 33
> gc()
used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells    1545225   82.6    4636290   247.7  351943442 18795.9
Vcells 1112069371 8484.5 3511821818 26793.1 8096548467 61771.8


You can see that the maximum amount of memory used is ~80GB, hence why you're running out of room on your cluster nodes.