How to handle big data sets (>10GB) in Bioconductor packages?
1
0
Entering edit mode
@perdedorium-15654
Last seen 5.6 years ago
Houston, TX

I am running an R script that uses the package MethylMix to download and preprocess all the available methylation data sets from TCGA. However, when I try to process the 450K breast cancer methylation data set (size ~13GB), I get a "Cannot allocate vector of size 12.8 GB" error.

I am running R 3.4.0 on 64-bit x86_64_pc-linux-gnu using my school's computing cluster, and each node has the following properties:

  • Dual Socket
  • Xeon E5-2690 v3 (Haswell) : 12 cores per socket (24 cores/node), 2.6 GHz
  • 64 GB DDR4-2133 (8 x 8GB dual rank x8 DIMMS)
  • No local disk
  • Hyperthreading Enabled - 48 threads (logical CPUs) per node

so it seems as though there should be enough memory for this operation. The operating system is Linux, so I thought that R will just use all available memory, unlike on Windows? And checking the process memory using ulimit returns "unlimited." I am not sure where the problem lies. My script is a loop that iterates over all cancers available on TCGA, if that makes any difference.

sessionInfo():
R version 3.4.0 (2017-04-21)
Platform: x86_64_pc-linux-gnu (64-bit)
Matrix products: default
BLAS/LAPACK: /opt/apps/intel/16.0.1.150/compilers_and_libraries_2016.1.150/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.so
attached base packages: stats, graphics grDevices, utils, datasets, methods, base
loaded via a namespace (and not attached): compiler_3.4.0

Code (inside for() loop):

# Download methylation data.
  METdirectories <- tryCatch(
    {
      Download_DNAmethylation(i, paste0(targetDirectory, "/Methylation/"))
    }, warning = function(w) {
      # For warnings, write them to the output file.
      cat(paste("Warning in cancer", i, "when downloading methylation data:", conditionMessage(w)), file=logfile, append=TRUE, sep = "\n")
    }, error = function(e) {
      # For errors, write them to the output file and then skip to the next cancer.
      cat(paste("Error in cancer", i, "when downloading methylation data:", conditionMessage(e)), file=logfile, append=TRUE, sep = "\n")
      return(NULL)
    }, finally = {
      # If everything went all right, make a note of that in the output file.
      cat(paste("Successfully downloaded methylation data for cancer", i), file=logfile, append=TRUE, sep = "\n")
    }
  )
  if(is.null(METdirectories)) next

# Process methylation data.
  METProcessedData <- tryCatch(
    {
      Preprocess_DNAmethylation(i, METdirectories)
    }, warning = function(w) {
      # For warnings, write them to the output file.
      cat(paste("Warning in cancer", i, "when processing methylation data:", conditionMessage(w)), file=logfile, append=TRUE, sep = "\n")
    }, error = function(e) {
      # For errors, write them to the output file and then skip to the next cancer.
      cat(paste("Error in cancer", i, "when processing methylation data:", conditionMessage(e)), file=logfile, append=TRUE, sep = "\n")
      return(NULL)
    }, finally = {
      # If everything went all right, make a note of that in the output file.
      cat(paste("Successfully processed methylation data for cancer", i), file=logfile, append=TRUE, sep = "\n")
    }
  )
  if(is.null(METProcessedData)) next
  # Save methylation processed data.
  saveRDS(METProcessedData, file=paste0(paste0(targetDirectory, "/Methylation/"), "MET_", i, "_Processed.rds"))
big data methylmix memory problem • 926 views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 16 days ago
United States

Because R uses 'copy-on-change' semantics, rather than reference-based semantics, it will often make several copies of data during, e.g., function calls. A rule of thumb, that I have no basis for, is that R will typically use 4x the memory of the largest object, so it is in some ways not surprising that you run out of memory here. 

It could well be that the function you're calling is written so that it uses more memory than required; hopefully the maintainer will respond here, profile their code, and arrive at a more efficient implementation.

ADD COMMENT
1
Entering edit mode

Out of curiosity I thought I'd run this on a machine with 1TB of RAM to see how much is actually required.  My first comment is that I can see why you'd want to do it on a cluster, after 24hrs I've only manged to process 10% of the breast cancer dataset.  More pertinently, here's the last line of the processing output, followed by a call to gc() after I stopped the process:

Starting batch 1 of 33Starting batch 2 of 33Starting batch 3 of 33
> gc()
             used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells    1545225   82.6    4636290   247.7  351943442 18795.9
Vcells 1112069371 8484.5 3511821818 26793.1 8096548467 61771.8

You can see that the maximum amount of memory used is ~80GB, hence why you're running out of room on your cluster nodes.

ADD REPLY

Login before adding your answer.

Traffic: 879 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6