Search
Question: ATACseqQC memory overflow issue
0
gravatar for José Luis Lavín
6 months ago by
Spain
José Luis Lavín10 wrote:

Dear All,

I'm trying to analizesome ATAC-seq samples using the ATACseqQC package and I am unable to make it work correctly.

The proccess always crashes when I reach the same part of the workflow shown in the vignette this function

fragSize <- fragSizeDist(bamfile, bamfile.labels) 

After a while running a memory overflow happens and the proccess gets killed.

I'm trying to analyze 100bp Illumina  Paired-End samples aligned with Bowtie2. The de-duplicated Bamfiles size is around 3,5 GB and the memory overflow occurs at 84 GB.

Is there any minimmum RAM requirement specifiedfor ATACseqQC that I overlooked? Or is it just that this particular package uses huge amounts of memory anyway?

Does anybody know af an alternative way to carry out this kind of analysis (ATACseq) that render some quality Heatmap and coverage curve for nucleosome positions?

Here is my code (up to the step it works):

library(ATACseqQC)
library(ChIPpeakAnno)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(phastCons100way.UCSC.hg19)
library(MotifDb)

knitr::opts_chunk$set(warning=FALSE, message=FALSE)

setwd("/path_to_filees/BAM_dedup/")

file.list <- as.list(list.files(".", pattern = "*.bam$"))
file.list

bamfile <-file.list[[1]]
bamfile.labels <- gsub(".bam", "", basename(bamfile))

bamQC(bamfile, outPath=NULL)

## generate fragement size distribution
fragSize <- fragSizeDist(bamfile, bamfile.labels)

And my Session Info:

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.4 (Nitrogen)

Matrix products: default
BLAS: /opt/R/R-3.4.2/lib/libRblas.so
LAPACK: /opt/R/R-3.4.2/lib/libRlapack.so

locale:
 [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8
 [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8
 [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils
 [8] datasets  methods   base

other attached packages:
 [1] MotifDb_1.20.0
 [2] phastCons100way.UCSC.hg19_3.6.0
 [3] GenomicScores_1.2.1
 [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
 [5] GenomicFeatures_1.30.0
 [6] AnnotationDbi_1.40.0
 [7] Biobase_2.38.0
 [8] BSgenome.Hsapiens.UCSC.hg19_1.4.0
 [9] BSgenome_1.46.0
[10] rtracklayer_1.38.2
[11] ChIPpeakAnno_3.12.4
[12] VennDiagram_1.6.18
[13] futile.logger_1.4.3
[14] GenomicRanges_1.30.1
[15] GenomeInfoDb_1.14.0
[16] Biostrings_2.46.0
[17] XVector_0.18.0
[18] IRanges_2.12.0
[19] ATACseqQC_1.2.7
[20] S4Vectors_0.16.0
[21] BiocGenerics_0.24.0

loaded via a namespace (and not attached):
 [1] ProtGenerics_1.10.0           bitops_1.0-6
 [3] matrixStats_0.52.2            bit64_0.9-7
 [5] progress_1.1.2                httr_1.3.1
 [7] tools_3.4.2                   R6_2.2.2
 [9] rGADEM_2.26.0                 splitstackshape_1.4.2
[11] seqLogo_1.44.0                DBI_0.7
[13] lazyeval_0.2.1                colorspace_1.3-2
[15] ade4_1.7-10                   motifStack_1.22.0
[17] prettyunits_1.0.2             RMySQL_0.10.13
[19] bit_1.1-12                    curl_3.1
[21] compiler_3.4.2                graph_1.56.0
[23] DelayedArray_0.4.1            grImport_0.9-0
[25] scales_0.5.0                  randomForest_4.6-12
[27] RBGL_1.54.0                   stringr_1.2.0
[29] digest_0.6.12                 Rsamtools_1.30.0
[31] pkgconfig_2.0.1               htmltools_0.3.6
[33] ensembldb_2.2.0               limma_3.34.0
[35] regioneR_1.10.0               htmlwidgets_0.9
[37] rlang_0.1.4                   RSQLite_2.0
[39] BiocInstaller_1.28.0          shiny_1.0.5
[41] BiocParallel_1.12.0           RCurl_1.95-4.8
[43] magrittr_1.5                  GO.db_3.5.0
[45] GenomeInfoDbData_1.0.0        Matrix_1.2-11
[47] Rcpp_0.12.13                  munsell_0.4.3
[49] stringi_1.1.6                 yaml_2.1.16
[51] MASS_7.3-47                   SummarizedExperiment_1.8.1
[53] zlibbioc_1.24.0               plyr_1.8.4
[55] AnnotationHub_2.10.1          blob_1.1.0
[57] lattice_0.20-35               splines_3.4.2
[59] multtest_2.34.0               MotIV_1.34.0
[61] seqinr_3.4-5                  biomaRt_2.34.1
[63] futile.options_1.0.0          XML_3.98-1.9
[65] data.table_1.10.4-3           lambda.r_1.2
[67] idr_1.2                       httpuv_1.3.5
[69] assertthat_0.2.0              mime_0.5
[71] xtable_1.8-2                  AnnotationFilter_1.2.0
[73] survival_2.41-3               tibble_1.3.4
[75] GenomicAlignments_1.14.1      memoise_1.1.0
[77] interactiveDisplayBase_1.16.0

 

Thanks in advance!
 

JL

 

 

ADD COMMENTlink modified 5 months ago by Haibo.Liu0 • written 6 months ago by José Luis Lavín10

Hi JL,

Thank you for trying ATACseqQC. Please have a try to remove bamQC from you current script. And have a try. I'm working on bamQC to improve this issue.

Jianhong.

ADD REPLYlink written 6 months ago by Ou, Jianhong1.1k

Dear Jianhong,

I have tried the following script (the bamQC command actually worked). I tried your suggestion using only a couple of chromosomes (e.g. 1 and 2) and it crashes at the a point a little bit further...

library(ATACseqQC)
library(ChIPpeakAnno)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(phastCons100way.UCSC.hg19)
library(MotifDb)

knitr::opts_chunk$set(warning=FALSE, message=FALSE)

file.list <- as.list(list.files(".", pattern = "*.bam$"))

bamfile <-file.list[[1]]
bamfile.labels <- gsub(".bam", "", basename(bamfile))

outPath <- "splited"
dir.create(outPath)

bamQC(bamfile, outPath=NULL)

## estimate Library Complexity---------------------------------------------
estimateLibComplexity(readsDupFreq(bamfile))
## generate fragement size distribution
fragSize <- fragSizeDist(bamfile, bamfile.labels)

## ------------------------------------------------------------------------
## bamfile tags
tags <- c("AS", "XN", "XM", "XO", "XG", "NM", "MD", "YS", "YT")
## files will be output into outPath

## shift the bam file by the 5'ends
library(BSgenome.Hsapiens.UCSC.hg19)

seqlev = paste0("chr", c(1:2, "X", "Y"))
which <- as(seqinfo(Hsapiens)[seqlev], "GRanges")
gal <- readBamFile(bamfile, tag=tags, which=which, asMates=TRUE)
gal1 <- shiftGAlignmentsList(gal)#######<- Crashes here #################
shiftedBamfile <- file.path(outPath, "shifted.bam")
export(gal1, shiftedBamfile)

SessionInfo is the same than the previous but now I am using 96 GB of RAM

Any thoughts/Suggestions or update of the package ?

Thanks in advance for your kind help

Best

JL
 

ADD REPLYlink modified 5 months ago • written 5 months ago by José Luis Lavín10
0
gravatar for Julie Zhu
6 months ago by
Julie Zhu3.8k
United States
Julie Zhu3.8k wrote:
José, Thanks for the feedback! Jianhong, one of the developers will work with you on the coding side. For now, please try to generate the plots using a couple of chromosomes from your BAM files. FYI, we have tested the effects of sequencing depth on the patterns in various diagnostic plots using subsamples of the BAM files from a successful experiment. Our results show that the fragment size distribution and the aggregated signals around TSSs from the subsamples with as low as 2.6 million uniquely mapped reads exhibit similar patterns to that of the full dataset (more than 26 million uniquely mapped reads) and are easily distinguishable from that of the failed experiment. Best regards, Julie From: "José Luis Lavín [bioc]" <noreply@bioconductor.org> Reply-To: "reply+870ae661+code@bioconductor.org" <reply+870ae661+code@bioconductor.org> Date: Monday, February 12, 2018 at 11:10 AM To: "Zhu, Lihua (Julie)" <julie.zhu@umassmed.edu> Subject: [bioc] ATACseqQC memory overflow issue Activity on a post you are following on support.bioconductor.org<https: support.bioconductor.org=""> User José Luis Lavín<https: support.bioconductor.org="" u="" 5531=""/> wrote Question: ATACseqQC memory overflow issue<https: support.bioconductor.org="" p="" 105904=""/>: Dear All, I'm trying to analizesome ATAC-seq samples using the ATACseqQC package and I am unable to make it work correctly. The proccess always crashes when I reach the same part of the workflow shown in the vignette this function fragSize <- fragSizeDist(bamfile, bamfile.labels) After a while running a memory overflow happens and the proccess gets killed. I'm trying to analyze 100bp Illumina Paired-End samples aligned with Bowtie2. The de-duplicated Bamfiles size is around 3,5 GB and the memory overflow occurs at 84 GB. Is there any minimmum RAM requirement specifiedfor ATACseqQC that I overlooked? Or is it just that this particular package uses huge amounts of memory anyway? Does anybody know af an alternative way to carry out this kind of analysis (ATACseq) that render some quality Heatmap and coverage curve for nucleosome positions? Here is my code (up to the step it works): library(ATACseqQC) library(ChIPpeakAnno) library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(phastCons100way.UCSC.hg19) library(MotifDb) knitr::opts_chunk$set(warning=FALSE, message=FALSE) setwd("/path_to_filees/BAM_dedup/") file.list <- as.list(list.files(".", pattern = "*.bam$")) file.list bamfile <-file.list[[1]] bamfile.labels <- gsub(".bam", "", basename(bamfile)) bamQC(bamfile, outPath=NULL) ## generate fragement size distribution fragSize <- fragSizeDist(bamfile, bamfile.labels) And my Session Info: > sessionInfo() R version 3.4.2 (2017-09-28) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Scientific Linux 7.4 (Nitrogen) Matrix products: default BLAS: /opt/R/R-3.4.2/lib/libRblas.so LAPACK: /opt/R/R-3.4.2/lib/libRlapack.so locale: [1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=es_ES.UTF-8 [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=es_ES.UTF-8 [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats4 parallel stats graphics grDevices utils [8] datasets methods base other attached packages: [1] MotifDb_1.20.0 [2] phastCons100way.UCSC.hg19_3.6.0 [3] GenomicScores_1.2.1 [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 [5] GenomicFeatures_1.30.0 [6] AnnotationDbi_1.40.0 [7] Biobase_2.38.0 [8] BSgenome.Hsapiens.UCSC.hg19_1.4.0 [9] BSgenome_1.46.0 [10] rtracklayer_1.38.2 [11] ChIPpeakAnno_3.12.4 [12] VennDiagram_1.6.18 [13] futile.logger_1.4.3 [14] GenomicRanges_1.30.1 [15] GenomeInfoDb_1.14.0 [16] Biostrings_2.46.0 [17] XVector_0.18.0 [18] IRanges_2.12.0 [19] ATACseqQC_1.2.7 [20] S4Vectors_0.16.0 [21] BiocGenerics_0.24.0 loaded via a namespace (and not attached): [1] ProtGenerics_1.10.0 bitops_1.0-6 [3] matrixStats_0.52.2 bit64_0.9-7 [5] progress_1.1.2 httr_1.3.1 [7] tools_3.4.2 R6_2.2.2 [9] rGADEM_2.26.0 splitstackshape_1.4.2 [11] seqLogo_1.44.0 DBI_0.7 [13] lazyeval_0.2.1 colorspace_1.3-2 [15] ade4_1.7-10 motifStack_1.22.0 [17] prettyunits_1.0.2 RMySQL_0.10.13 [19] bit_1.1-12 curl_3.1 [21] compiler_3.4.2 graph_1.56.0 [23] DelayedArray_0.4.1 grImport_0.9-0 [25] scales_0.5.0 randomForest_4.6-12 [27] RBGL_1.54.0 stringr_1.2.0 [29] digest_0.6.12 Rsamtools_1.30.0 [31] pkgconfig_2.0.1 htmltools_0.3.6 [33] ensembldb_2.2.0 limma_3.34.0 [35] regioneR_1.10.0 htmlwidgets_0.9 [37] rlang_0.1.4 RSQLite_2.0 [39] BiocInstaller_1.28.0 shiny_1.0.5 [41] BiocParallel_1.12.0 RCurl_1.95-4.8 [43] magrittr_1.5 GO.db_3.5.0 [45] GenomeInfoDbData_1.0.0 Matrix_1.2-11 [47] Rcpp_0.12.13 munsell_0.4.3 [49] stringi_1.1.6 yaml_2.1.16 [51] MASS_7.3-47 SummarizedExperiment_1.8.1 [53] zlibbioc_1.24.0 plyr_1.8.4 [55] AnnotationHub_2.10.1 blob_1.1.0 [57] lattice_0.20-35 splines_3.4.2 [59] multtest_2.34.0 MotIV_1.34.0 [61] seqinr_3.4-5 biomaRt_2.34.1 [63] futile.options_1.0.0 XML_3.98-1.9 [65] data.table_1.10.4-3 lambda.r_1.2 [67] idr_1.2 httpuv_1.3.5 [69] assertthat_0.2.0 mime_0.5 [71] xtable_1.8-2 AnnotationFilter_1.2.0 [73] survival_2.41-3 tibble_1.3.4 [75] GenomicAlignments_1.14.1 memoise_1.1.0 [77] interactiveDisplayBase_1.16.0 Thanks in advance! JL Post tags: atac-seq, software error You may reply via email or visit ATACseqQC memory overflow issue
ADD COMMENTlink written 6 months ago by Julie Zhu3.8k
0
gravatar for Haibo.Liu
5 months ago by
Haibo.Liu0
Haibo.Liu0 wrote:

Please refer to my git hub scripts: https://github.com/seaYali/ATACseqQC/blob/master/ScriptsForATACseqQCCaseStudy.txt for your analysis. Currently, the bottle neck of this package is the BamQC.R function. I did this step using samtools instead for efficinecy. Importantly, for the purpose of QC, You don't need to use the full bam except for BAM QC. Using data from a few chromosomes will be good enough. Using Phastcons information to bin reads into mono-, di- or tri-nucleosome is not necessary, and increase the computing time significantly.

ADD COMMENTlink written 5 months ago by Haibo.Liu0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 287 users visited in the last hour