Memory errors using read.bismark?
2
0
Entering edit mode
@675ece96
Last seen 23 hours ago
Spain

Dear all,

I am trying to build a bsse object with 15 genome wide cytosine report generated by bismark using read.bismark. I have use it before in similar scenarios with similar files and worked just fine. The only difference now is that my reports are bigger (around 5gb compressed). I have some of the following errors

'''System errno 22 unmapping file: Invalid argument System errno 22 unmapping file: Invalid argument Stop worker failed with the error: wrong args for environment subassignment Error: BiocParallel errors 0 remote errors, element index: 14 unevaluated and other errors first remote error:'''

or

'''Error in !bpok : invalid argument type'''

I have seen previous reports about lack of memory triggering !bpok errors, nevertheless my server has 504 GB and never in the process is close to use all the memory.

Trying to solve the problem, I have used the option to write to disk #BACKEND = "HDF5Array" and even with one file, the process hangs.

Thanks in advance,

Juan

library(foreach)
library(dmrseq)
library(BiocParallel)
library(HDF5Array)
register(MulticoreParam(10))

# Read metadata
metadata <- as.data.frame.array(data.table::fread("tissue_listSamples_dmrseq.csv", nThread = 10))

# Same order for metadata file and CpG_report file in file.list
#file.list <- metadata$file_name
#test_data<- as.data.frame(metadata$SampleID)

rownames(metadata) <- metadata$SampleID

cpg_txt.dir <- "/home/rstudio/AA/samples"
dir <- "/home/rstudio/AA/results"

file.list <- file.path(cpg_txt.dir, metadata$file_name)

bsseq.obj <- read.bismark(files = file.list[1], 
                           rmZeroCov = FALSE, 
                           strandCollapse = TRUE,
                           nThread = 2,
                           #BACKEND = "HDF5Array",
                           #dir = "dir",
                           #replace = FALSE,
                           verbose = TRUE)

# include your problematic code here with any corresponding output 
# please also include the results of running the following in an R session 

sessionInfo( )
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] HDF5Array_1.30.1            rhdf5_2.46.1                DelayedArray_0.28.0        
 [4] SparseArray_1.2.4           S4Arrays_1.2.1              abind_1.4-5                
 [7] Matrix_1.6-1.1              BiocParallel_1.36.0         dmrseq_1.22.1              
[10] bsseq_1.38.0                SummarizedExperiment_1.32.0 Biobase_2.62.0             
[13] MatrixGenerics_1.14.0       matrixStats_1.2.0           GenomicRanges_1.54.1       
[16] GenomeInfoDb_1.38.8         IRanges_2.36.0              S4Vectors_0.40.2           
[19] BiocGenerics_0.48.1         foreach_1.5.2              

loaded via a namespace (and not attached):
  [1] DBI_1.2.2                     bitops_1.0-7                 
  [3] biomaRt_2.58.2                permute_0.9-7                
  [5] rlang_1.1.3                   magrittr_2.0.3               
  [7] compiler_4.3.2                RSQLite_2.3.5                
  [9] GenomicFeatures_1.54.4        DelayedMatrixStats_1.24.0    
 [11] reshape2_1.4.4                png_0.1-8                    
 [13] vctrs_0.6.5                   stringr_1.5.1                
 [15] pkgconfig_2.0.3               crayon_1.5.2                 
 [17] fastmap_1.1.1                 dbplyr_2.4.0                 
 [19] XVector_0.42.0                ellipsis_0.3.2               
 [21] utf8_1.2.4                    Rsamtools_2.18.0             
 [23] promises_1.2.1                tzdb_0.4.0                   
 [25] bit_4.0.5                     zlibbioc_1.48.2              
 [27] cachem_1.0.8                  progress_1.2.3               
 [29] blob_1.2.4                    later_1.3.2                  
 [31] rhdf5filters_1.14.1           Rhdf5lib_1.24.2              
 [33] interactiveDisplayBase_1.40.0 prettyunits_1.2.0            
 [35] parallel_4.3.2                R6_2.5.1                     
 [37] RColorBrewer_1.1-3            stringi_1.8.3                
 [39] limma_3.58.1                  rtracklayer_1.62.0           
 [41] Rcpp_1.0.12                   iterators_1.0.14             
 [43] R.utils_2.12.3                readr_2.1.5                  
 [45] splines_4.3.2                 httpuv_1.6.14                
 [47] tidyselect_1.2.0              rstudioapi_0.15.0            
 [49] yaml_2.3.8                    codetools_0.2-19             
 [51] curl_5.2.0                    doRNG_1.8.6                  
 [53] plyr_1.8.9                    regioneR_1.34.0              
 [55] lattice_0.21-9                tibble_3.2.1                 
 [57] shiny_1.8.0                   KEGGREST_1.42.0              
 [59] BiocFileCache_2.10.2          xml2_1.3.6                   
 [61] Biostrings_2.70.3             pillar_1.9.0                 
 [63] BiocManager_1.30.22           filelock_1.0.3               
 [65] rngtools_1.5.2                generics_0.1.3               
 [67] RCurl_1.98-1.14               ggplot2_3.5.0                
 [69] hms_1.1.3                     BiocVersion_3.18.1           
 [71] sparseMatrixStats_1.14.0      munsell_0.5.0                
 [73] scales_1.3.0                  bumphunter_1.44.0            
 [75] gtools_3.9.5                  xtable_1.8-4                 
 [77] glue_1.7.0                    tools_4.3.2                  
 [79] AnnotationHub_3.10.0          BiocIO_1.12.0                
 [81] data.table_1.15.0             BSgenome_1.70.2              
 [83] locfit_1.5-9.8                GenomicAlignments_1.38.2     
 [85] XML_3.99-0.16.1               grid_4.3.2                   
 [87] AnnotationDbi_1.64.1          colorspace_2.1-0             
 [89] nlme_3.1-163                  GenomeInfoDbData_1.2.11      
 [91] restfulr_0.0.15               annotatr_1.28.0              
 [93] cli_3.6.2                     rappdirs_0.3.3               
 [95] fansi_1.0.6                   dplyr_1.1.4                  
 [97] gtable_0.3.4                  outliers_0.15                
 [99] R.methodsS3_1.8.2             digest_0.6.34                
[101] rjson_0.2.21                  memoise_2.0.1                
[103] htmltools_0.5.7               R.oo_1.26.0                  
[105] lifecycle_1.0.4               httr_1.4.7                   
[107] statmod_1.5.0                 mime_0.12                    
[109] bit64_4.0.5
dmrseq bsseq • 376 views
ADD COMMENT
1
Entering edit mode
Peter Hickey ▴ 740
@petehaitch
Last seen 11 days ago
WEHI, Melbourne, Australia

I'm one of the bsseq developers. Are you able to share one of your files with me (e.g., via Dropbox, Google Drive, OneDrive, etc.). That'll make debugging much easier.

ADD COMMENT
0
Entering edit mode

Hi Peter, can you please send me your email address to share it privately?

Thanks for your quick answer!

Greetings,

Juan

ADD REPLY
1
Entering edit mode

Sure, it's actually available in the DESCRIPTION file that is included in bsseq: https://github.com/hansenlab/bsseq/blob/d32c2c6709fac68b59578fac79d0eda385104585/DESCRIPTION#L10

ADD REPLY
0
Entering edit mode
Peter Hickey ▴ 740
@petehaitch
Last seen 11 days ago
WEHI, Melbourne, Australia

Thanks for sharing an example file. This is a very large non-CpG methylation dataset, which are tricky to work with for a number of reasons (not just the sheer size of the data). That said, I was able to load the example data successfully on a machine with 500 GB RAM. I didn't profile the memory closely but I think it uses < 50% of the available RAM and took ~45 minutes.

Here's the code I ran (using an up-to-date BioC 3.18):

library(bsseq)
library(here)
library(BiocParallel)
register(SerialParam())

f <- here("data/004_0193_EM_009_pe_picard_filtered.CX_report.txt.gz")

bsseq <- read.bismark(
  files = f,
  # NOTE: This should be FALSE for non-CpG methylation; see `?read.bismark`
  strandCollapse = FALSE,
  # NOTE: Maximum verbosity.
  verbose = Inf)
# [read.bismark] Parsing files and constructing valid loci ...
# [.contructFWGRangesFromBismarkFiles] Extracting loci from '/vast/scratch/users/hickey/bsseq_9157621/data/004_0193_EM_009_pe_picard_filtered.CX_report.txt.gz'
# Done in 1150.4 secs
# [read.bismark] Parsing files and constructing 'M' and 'Cov' matrices ...
# Done in 1347.5 secs
# [read.bismark] Constructing BSseq object ...

dim(bsseq)
# [1] 1203763667          1

print(object.size(bsseq), units = "a")
# 17.9 Gb

Even when using the (IMO) incorrect strandCollapse = TRUE to load the non-CpG data, it still loaded successfully:

bsseq <- read.bismark(
  files = f,
  # NOTE: Should be FALSE for non-CpG methylation.
  strandCollapse = TRUE,
  verbose = Inf)
# [read.bismark] Parsing files and constructing valid loci ...
# [.contructFWGRangesFromBismarkFiles] Extracting loci from '/vast/scratch/users/hickey/bsseq_9157621/data/004_0193_EM_009_pe_picard_filtered.CX_report.txt.gz'
# Done in 1224.6 secs
# [read.bismark] Parsing files and constructing 'M' and 'Cov' matrices ...
# Done in 1415.2 secs
# [read.bismark] Constructing BSseq object ...

dim(bsseq)
# [1] 1174276573          1

print(object.size(bsseq), units = "a")
# 17.5 Gb

Note that all of this was done using the default nThreads = 1L and I don't recommend changing this unless you know what it does (see the documentation). Finally, if you need to create a HDF5-backed _BSseq_ object then loading the data will necessarily take longer because it has to be written to disk in an HDF5 file. The performance of this is very system-dependent.

ADD COMMENT
0
Entering edit mode

Thanks for the answer. Actually, processing one file and keeping it in memory wasn't the problem, is more processing all the 15 files to build the bss object. For your answer, my guess that the main issue is the memory, if only one file is using < 50% of the available RAM, probably will be not reachable to perform this analysis.

Thanks a lot for your answer,

Cheers,

Juan

ADD REPLY
1
Entering edit mode

If you process serially (i.e. with BPPARAM = SerialParam()) then you might be okay. The final object will be on the order of 250 GB in-memory (17.9 GB * 15 = 268.5 GB minus a few GB for the rowRanges which are only stored once in the final object) and, if I had to guess, the peak memory usage will be on the order of 300 - 350 GB.

If you combine BPPARAM = SerialParam() with BACKEND = "HDF5Array" then the memory footprint of the final object will be dramatically smaller (on the order of a few GB) and the peak memory usage will be lower but at the cost of longer processing time. You will likely need to experiment with options for chunking and compression of the HDF5 file to get the best performance. And be sure that you're writing to a fast disk (i.e. ideally a local scratch disk rather than a networked disk) to improve the performance.

You might also consider splitting up your data processing by chromosome and then cbind()-ing together the resulting _BSseq_ objects to create a genome-wide _BSseq_ object (if this is something you require).

Overall, our experience working with non-CpG methylation data is its inherently much harder than CpG data. And a lot seems to depend on the system it's being run on. However, we have used bsseq to analyse non-CpG methylation data in n=45 (https://pubmed.ncbi.nlm.nih.gov/30643296/) and n=184 (https://pubmed.ncbi.nlm.nih.gov/33888138/) samples. This work was done more 5+ years ago and, from memory, on a machine with 512 GB of RAM. I think for some of these analyses we also separated out non-CpG methylation into CpA and CpT (there was negligible CpC methylation, I think) and also separated by DNA strand which further reduce the memory requirements of the individual objects.

These choices were motivated by biological considerations but also eased some of the computational burdens, and you might consider what similar choices are appropriate for your dataset.

ADD REPLY
0
Entering edit mode

Thanks again for your answer, I will try to play a little bit with the parameters. Only one thing, my data are actually CpG methylation :)

Cheers

Juan

ADD REPLY
1
Entering edit mode

Only one thing, my data are actually CpG methylation :)

No, the file you sent me is CX methylation (i.e. CpA, CpC, CpG, CpT), as is clear from the filename (004_0193_EM_009_pe_picard_filtered.CX_report.txt.gz) and from looking at the contents of the file

gunzip -c data/004_0193_EM_009_pe_picard_filtered.CX_report.txt.gz | grep "CA[A|C|G|T]" | head -n 1
1   10108   +   0   4   CHH CAA
gunzip -c data/004_0193_EM_009_pe_picard_filtered.CX_report.txt.gz | grep "CC[A|C|G|T]" | head -n 1
1   10004   +   0   1   CHH CCC
gunzip -c data/004_0193_EM_009_pe_picard_filtered.CX_report.txt.gz | grep "CG[A|C|G|T]" | head -n 1
1   10469   +   2   0   CG  CGC
gunzip -c data/004_0193_EM_009_pe_picard_filtered.CX_report.txt.gz | grep "CT[A|C|G|T]" | head -n 1
1   10006   +   0   1   CHH CTA

If you really want to only analyse CpG methylation then it'll be much simpler if you use the appropriate file produced by Bismark rather than the file you have (which was likely produced by Bismark with the --CX_context flag).

ADD REPLY

Login before adding your answer.

Traffic: 716 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6