ATACseqQC memory overflow issue
2
0
Entering edit mode
@jose-luis-lavin-5531
Last seen 3.6 years ago
Spain

Dear All,

I'm trying to analizesome ATAC-seq samples using the ATACseqQC package and I am unable to make it work correctly.

The proccess always crashes when I reach the same part of the workflow shown in the vignette this function

fragSize <- fragSizeDist(bamfile, bamfile.labels) 

After a while running a memory overflow happens and the proccess gets killed.

I'm trying to analyze 100bp Illumina  Paired-End samples aligned with Bowtie2. The de-duplicated Bamfiles size is around 3,5 GB and the memory overflow occurs at 84 GB.

Is there any minimmum RAM requirement specifiedfor ATACseqQC that I overlooked? Or is it just that this particular package uses huge amounts of memory anyway?

Does anybody know af an alternative way to carry out this kind of analysis (ATACseq) that render some quality Heatmap and coverage curve for nucleosome positions?

Here is my code (up to the step it works):

library(ATACseqQC)
library(ChIPpeakAnno)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(phastCons100way.UCSC.hg19)
library(MotifDb)

knitr::opts_chunk$set(warning=FALSE, message=FALSE)

setwd("/path_to_filees/BAM_dedup/")

file.list <- as.list(list.files(".", pattern = "*.bam$"))
file.list

bamfile <-file.list[[1]]
bamfile.labels <- gsub(".bam", "", basename(bamfile))

bamQC(bamfile, outPath=NULL)

## generate fragement size distribution
fragSize <- fragSizeDist(bamfile, bamfile.labels)

And my Session Info:

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.4 (Nitrogen)

Matrix products: default
BLAS: /opt/R/R-3.4.2/lib/libRblas.so
LAPACK: /opt/R/R-3.4.2/lib/libRlapack.so

locale:
 [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8
 [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8
 [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils
 [8] datasets  methods   base

other attached packages:
 [1] MotifDb_1.20.0
 [2] phastCons100way.UCSC.hg19_3.6.0
 [3] GenomicScores_1.2.1
 [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
 [5] GenomicFeatures_1.30.0
 [6] AnnotationDbi_1.40.0
 [7] Biobase_2.38.0
 [8] BSgenome.Hsapiens.UCSC.hg19_1.4.0
 [9] BSgenome_1.46.0
[10] rtracklayer_1.38.2
[11] ChIPpeakAnno_3.12.4
[12] VennDiagram_1.6.18
[13] futile.logger_1.4.3
[14] GenomicRanges_1.30.1
[15] GenomeInfoDb_1.14.0
[16] Biostrings_2.46.0
[17] XVector_0.18.0
[18] IRanges_2.12.0
[19] ATACseqQC_1.2.7
[20] S4Vectors_0.16.0
[21] BiocGenerics_0.24.0

loaded via a namespace (and not attached):
 [1] ProtGenerics_1.10.0           bitops_1.0-6
 [3] matrixStats_0.52.2            bit64_0.9-7
 [5] progress_1.1.2                httr_1.3.1
 [7] tools_3.4.2                   R6_2.2.2
 [9] rGADEM_2.26.0                 splitstackshape_1.4.2
[11] seqLogo_1.44.0                DBI_0.7
[13] lazyeval_0.2.1                colorspace_1.3-2
[15] ade4_1.7-10                   motifStack_1.22.0
[17] prettyunits_1.0.2             RMySQL_0.10.13
[19] bit_1.1-12                    curl_3.1
[21] compiler_3.4.2                graph_1.56.0
[23] DelayedArray_0.4.1            grImport_0.9-0
[25] scales_0.5.0                  randomForest_4.6-12
[27] RBGL_1.54.0                   stringr_1.2.0
[29] digest_0.6.12                 Rsamtools_1.30.0
[31] pkgconfig_2.0.1               htmltools_0.3.6
[33] ensembldb_2.2.0               limma_3.34.0
[35] regioneR_1.10.0               htmlwidgets_0.9
[37] rlang_0.1.4                   RSQLite_2.0
[39] BiocInstaller_1.28.0          shiny_1.0.5
[41] BiocParallel_1.12.0           RCurl_1.95-4.8
[43] magrittr_1.5                  GO.db_3.5.0
[45] GenomeInfoDbData_1.0.0        Matrix_1.2-11
[47] Rcpp_0.12.13                  munsell_0.4.3
[49] stringi_1.1.6                 yaml_2.1.16
[51] MASS_7.3-47                   SummarizedExperiment_1.8.1
[53] zlibbioc_1.24.0               plyr_1.8.4
[55] AnnotationHub_2.10.1          blob_1.1.0
[57] lattice_0.20-35               splines_3.4.2
[59] multtest_2.34.0               MotIV_1.34.0
[61] seqinr_3.4-5                  biomaRt_2.34.1
[63] futile.options_1.0.0          XML_3.98-1.9
[65] data.table_1.10.4-3           lambda.r_1.2
[67] idr_1.2                       httpuv_1.3.5
[69] assertthat_0.2.0              mime_0.5
[71] xtable_1.8-2                  AnnotationFilter_1.2.0
[73] survival_2.41-3               tibble_1.3.4
[75] GenomicAlignments_1.14.1      memoise_1.1.0
[77] interactiveDisplayBase_1.16.0

 

Thanks in advance!
 

JL

 

 

atac-seq software error atacseqqc • 1.0k views
ADD COMMENT
0
Entering edit mode

Hi JL,

Thank you for trying ATACseqQC. Please have a try to remove bamQC from you current script. And have a try. I'm working on bamQC to improve this issue.

Jianhong.

ADD REPLY
0
Entering edit mode

Dear Jianhong,

I have tried the following script (the bamQC command actually worked). I tried your suggestion using only a couple of chromosomes (e.g. 1 and 2) and it crashes at the a point a little bit further...

library(ATACseqQC)
library(ChIPpeakAnno)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(phastCons100way.UCSC.hg19)
library(MotifDb)

knitr::opts_chunk$set(warning=FALSE, message=FALSE)

file.list <- as.list(list.files(".", pattern = "*.bam$"))

bamfile <-file.list[[1]]
bamfile.labels <- gsub(".bam", "", basename(bamfile))

outPath <- "splited"
dir.create(outPath)

bamQC(bamfile, outPath=NULL)

## estimate Library Complexity---------------------------------------------
estimateLibComplexity(readsDupFreq(bamfile))
## generate fragement size distribution
fragSize <- fragSizeDist(bamfile, bamfile.labels)

## ------------------------------------------------------------------------
## bamfile tags
tags <- c("AS", "XN", "XM", "XO", "XG", "NM", "MD", "YS", "YT")
## files will be output into outPath

## shift the bam file by the 5'ends
library(BSgenome.Hsapiens.UCSC.hg19)

seqlev = paste0("chr", c(1:2, "X", "Y"))
which <- as(seqinfo(Hsapiens)[seqlev], "GRanges")
gal <- readBamFile(bamfile, tag=tags, which=which, asMates=TRUE)
gal1 <- shiftGAlignmentsList(gal)#######<- Crashes here #################
shiftedBamfile <- file.path(outPath, "shifted.bam")
export(gal1, shiftedBamfile)

SessionInfo is the same than the previous but now I am using 96 GB of RAM

Any thoughts/Suggestions or update of the package ?

Thanks in advance for your kind help

Best

JL
 

ADD REPLY
0
Entering edit mode

Getting also process killed when reading huge bam file ~ 9G with readBamFile function.

ADD REPLY
0
Entering edit mode

One of the developers, Jianhong, will work on making it work with big bam files. For now, please try to generate the plots using a couple of chromosomes from your BAM files. Thanks!

Best regards,

Julie

ADD REPLY
0
Entering edit mode

Thanks to Jianhong, the memory issue has be fixed in the development version.

Here are the scripts to install the dev version of ATACseqQC.

library(devtools); install_github("jianhong/ATACseqQC")

Please set bigFile=TRUE when using readBamFile. Thanks!

Best regards,

Julie

ADD REPLY
0
Entering edit mode
Julie Zhu ★ 4.3k
@julie-zhu-3596
Last seen 11 weeks ago
United States
José, Thanks for the feedback! Jianhong, one of the developers will work with you on the coding side. For now, please try to generate the plots using a couple of chromosomes from your BAM files. FYI, we have tested the effects of sequencing depth on the patterns in various diagnostic plots using subsamples of the BAM files from a successful experiment. Our results show that the fragment size distribution and the aggregated signals around TSSs from the subsamples with as low as 2.6 million uniquely mapped reads exhibit similar patterns to that of the full dataset (more than 26 million uniquely mapped reads) and are easily distinguishable from that of the failed experiment. Best regards, Julie From: "José Luis Lavín [bioc]" <noreply@bioconductor.org> Reply-To: "reply+870ae661+code@bioconductor.org" <reply+870ae661+code@bioconductor.org> Date: Monday, February 12, 2018 at 11:10 AM To: "Zhu, Lihua (Julie)" <julie.zhu@umassmed.edu> Subject: [bioc] ATACseqQC memory overflow issue Activity on a post you are following on support.bioconductor.org<https: support.bioconductor.org=""> User José Luis Lavín<https: support.bioconductor.org="" u="" 5531=""/> wrote Question: ATACseqQC memory overflow issue<https: support.bioconductor.org="" p="" 105904=""/>: Dear All, I'm trying to analizesome ATAC-seq samples using the ATACseqQC package and I am unable to make it work correctly. The proccess always crashes when I reach the same part of the workflow shown in the vignette this function fragSize <- fragSizeDist(bamfile, bamfile.labels) After a while running a memory overflow happens and the proccess gets killed. I'm trying to analyze 100bp Illumina Paired-End samples aligned with Bowtie2. The de-duplicated Bamfiles size is around 3,5 GB and the memory overflow occurs at 84 GB. Is there any minimmum RAM requirement specifiedfor ATACseqQC that I overlooked? Or is it just that this particular package uses huge amounts of memory anyway? Does anybody know af an alternative way to carry out this kind of analysis (ATACseq) that render some quality Heatmap and coverage curve for nucleosome positions? Here is my code (up to the step it works): library(ATACseqQC) library(ChIPpeakAnno) library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(phastCons100way.UCSC.hg19) library(MotifDb) knitr::opts_chunk$set(warning=FALSE, message=FALSE) setwd("/path_to_filees/BAM_dedup/") file.list <- as.list(list.files(".", pattern = "*.bam$")) file.list bamfile <-file.list[[1]] bamfile.labels <- gsub(".bam", "", basename(bamfile)) bamQC(bamfile, outPath=NULL) ## generate fragement size distribution fragSize <- fragSizeDist(bamfile, bamfile.labels) And my Session Info: > sessionInfo() R version 3.4.2 (2017-09-28) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Scientific Linux 7.4 (Nitrogen) Matrix products: default BLAS: /opt/R/R-3.4.2/lib/libRblas.so LAPACK: /opt/R/R-3.4.2/lib/libRlapack.so locale: [1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=es_ES.UTF-8 [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=es_ES.UTF-8 [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats4 parallel stats graphics grDevices utils [8] datasets methods base other attached packages: [1] MotifDb_1.20.0 [2] phastCons100way.UCSC.hg19_3.6.0 [3] GenomicScores_1.2.1 [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 [5] GenomicFeatures_1.30.0 [6] AnnotationDbi_1.40.0 [7] Biobase_2.38.0 [8] BSgenome.Hsapiens.UCSC.hg19_1.4.0 [9] BSgenome_1.46.0 [10] rtracklayer_1.38.2 [11] ChIPpeakAnno_3.12.4 [12] VennDiagram_1.6.18 [13] futile.logger_1.4.3 [14] GenomicRanges_1.30.1 [15] GenomeInfoDb_1.14.0 [16] Biostrings_2.46.0 [17] XVector_0.18.0 [18] IRanges_2.12.0 [19] ATACseqQC_1.2.7 [20] S4Vectors_0.16.0 [21] BiocGenerics_0.24.0 loaded via a namespace (and not attached): [1] ProtGenerics_1.10.0 bitops_1.0-6 [3] matrixStats_0.52.2 bit64_0.9-7 [5] progress_1.1.2 httr_1.3.1 [7] tools_3.4.2 R6_2.2.2 [9] rGADEM_2.26.0 splitstackshape_1.4.2 [11] seqLogo_1.44.0 DBI_0.7 [13] lazyeval_0.2.1 colorspace_1.3-2 [15] ade4_1.7-10 motifStack_1.22.0 [17] prettyunits_1.0.2 RMySQL_0.10.13 [19] bit_1.1-12 curl_3.1 [21] compiler_3.4.2 graph_1.56.0 [23] DelayedArray_0.4.1 grImport_0.9-0 [25] scales_0.5.0 randomForest_4.6-12 [27] RBGL_1.54.0 stringr_1.2.0 [29] digest_0.6.12 Rsamtools_1.30.0 [31] pkgconfig_2.0.1 htmltools_0.3.6 [33] ensembldb_2.2.0 limma_3.34.0 [35] regioneR_1.10.0 htmlwidgets_0.9 [37] rlang_0.1.4 RSQLite_2.0 [39] BiocInstaller_1.28.0 shiny_1.0.5 [41] BiocParallel_1.12.0 RCurl_1.95-4.8 [43] magrittr_1.5 GO.db_3.5.0 [45] GenomeInfoDbData_1.0.0 Matrix_1.2-11 [47] Rcpp_0.12.13 munsell_0.4.3 [49] stringi_1.1.6 yaml_2.1.16 [51] MASS_7.3-47 SummarizedExperiment_1.8.1 [53] zlibbioc_1.24.0 plyr_1.8.4 [55] AnnotationHub_2.10.1 blob_1.1.0 [57] lattice_0.20-35 splines_3.4.2 [59] multtest_2.34.0 MotIV_1.34.0 [61] seqinr_3.4-5 biomaRt_2.34.1 [63] futile.options_1.0.0 XML_3.98-1.9 [65] data.table_1.10.4-3 lambda.r_1.2 [67] idr_1.2 httpuv_1.3.5 [69] assertthat_0.2.0 mime_0.5 [71] xtable_1.8-2 AnnotationFilter_1.2.0 [73] survival_2.41-3 tibble_1.3.4 [75] GenomicAlignments_1.14.1 memoise_1.1.0 [77] interactiveDisplayBase_1.16.0 Thanks in advance! JL Post tags: atac-seq, software error You may reply via email or visit ATACseqQC memory overflow issue
ADD COMMENT
0
Entering edit mode
Haibo.Liu • 0
@haiboliu-13750
Last seen 3.6 years ago

Please refer to my git hub scripts: https://github.com/seaYali/ATACseqQC/blob/master/ScriptsForATACseqQCCaseStudy.txt for your analysis. Currently, the bottle neck of this package is the BamQC.R function. I did this step using samtools instead for efficinecy. Importantly, for the purpose of QC, You don't need to use the full bam except for BAM QC. Using data from a few chromosomes will be good enough. Using Phastcons information to bin reads into mono-, di- or tri-nucleosome is not necessary, and increase the computing time significantly.

ADD COMMENT

Login before adding your answer.

Traffic: 423 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6