Question

ATACseqQC shiftGAlignmentsList() problem

0

Entering edit mode

jfmay • 0

@518122bc

Last seen 4 months ago

Canada

I am trying to follow the tutorial found here . The tutorial uses a subset of the sample data (chromosome 1) to make the demo run faster, but I am trying to shift my bam file for the whole mouse genome (so I changed "seqlev" from the tutorial code to "mmstdchr" when I applied it to my own). Here is my code:

library(ATACseqQC)
library(BSgenome.Mmusculus.UCSC.mm10)
mouse <- BSgenome.Mmusculus.UCSC.mm10 # doing this subsets the mouse genome assembly to only include "standard" sequences and not alternates
mmstdchr <- standardChromosomes(mouse)

tags <- c("AS", "XN", "XM", "XO", "XG", "NM", "MD", "YS", "YT")
outPath <- "splited"
dir.create(outPath)

bamfile <- "D:/All open chromatin bam files for counting/Sorted_3-AL1_filtered.bam"
bamfile.labels <- gsub(".bam", "", basename(bamfile))
which <- as(seqinfo(Mmusculus)
            [mmstdchr]
            , "GRanges")
gal <- readBamFile(bamfile, tag=tags, which=which, asMates=TRUE)
gal1 <- shiftGAlignmentsList(gal)

The issue is that when the shiftGAlignmentsList(gal) completes, gal1 is generated into a large GAlignments object 54 GB in size, whereas the bam file I'm shifting (Sorted_3-AL1_filtered.bam) is only 4.6 GB. This seems very wrong. And I can't even export gal1 because it's too large. Can anyone tell me what I'm doing wrong, and how to properly generate a shift bam file?

ATACseqQC ATACSeq • 641 views

ADD COMMENT • link updated 8 weeks ago by danalvoronov ▴ 10 • written 5 months ago by jfmay • 0

0

Entering edit mode

Hi,

Thank you for trying ATACseqQC. Could you let me know the version of ATACseqQC you are using? And did you try to set bigFile=TRUE when you call readBamFile?

Jianhong.

ADD REPLY • link 4 months ago Ou, Jianhong ★ 1.3k

0

Entering edit mode

Thanks for your response. The version of ATACseqQC I'm using is 1.26.0. I tried to set bigFile=TRUE. The resulting object gal1 still ends up the same huge size. Here is what happens when I try to export this file:

export(gal1, "3-AL1_Tn5_shift.bam")

Error: cannot allocate vector of size 987.8 Mb

I'm not sure I fully understand, but I think R is running out of memory at this stage?

ADD REPLY • link 4 months ago jfmay • 0

0

Entering edit mode

try: gal1 <- shiftGAlignmentsList(gal, outbam='3-AL1_Tn5_shift.bam'). I will try to figure out why the memory is so high.

ADD REPLY • link 4 months ago Ou, Jianhong ★ 1.3k

score 0 · Answer 1 · 2023-12-12

0

Entering edit mode

Ou, Jianhong ★ 1.3k

@ou-jianhong-4539

Last seen 9 days ago

United States

Hi jfmay,

I updated the development version to fix the issue that caused by readGAlignmentsList does not support read by chunk. This updates will decrease the memory cost to about 10Gb for a 5Gb bam file (not fully tested) for single session. If you want to process the file with multiple process, the memory cost will be about 8Gb*(#process+1). You can set a smaller slidingWindowSize to decrease the memory cost. The current process still have lot of space to improve for how to handle the seq and qual in the meta data of GAlignments. But it will take time. I will continues work on that. To install the development version of ATACseqQC, you may want to try: BiocManager::install("jianhong/ATACseqQC")

Let me know if you still have any question.

Jianhong.

ADD COMMENT • link 4 months ago Ou, Jianhong ★ 1.3k

0

Entering edit mode

Hi Jianhong,

I have a similar issue with shiftGAlignmentsList(), I am loading a downsampled bam file, which is 9.5 Mb itself like this:

seqinformation <- seqinfo(TxDb.genome3.own)
bamf <- "downsampled_nodup_paired_test.bam"
seqlev <- seqinformation@seqnames
which <- as(seqinformation[seqlev], "GRanges")
bamal <- readBamFile(bamf, which = which, asMates=TRUE, bigFile=TRUE)
bamsal <- shiftGAlignmentsList(bamal)

The weird part is that this 9.5 Mb file takes ages to shift and it the end uses multiple Gbs of RAM (went up to 7 Gb then dropped to 4.4 Gb then to 9Gb over more than half an hour), then gives an error cannot allocate vector of size 8.0 Gb. I know that my machine does not have much RAM (16 Gb), but it should be more than enough for a 9.5 Mb file. This RAM usage for such a small file size seems wrong. I use ATACseqQC version 1.27.1.

Changing the readBamFile() bigFile= to bigFile=FALSE, does not help. It runs out of memory right at this step.

What could be the reason? How to load such a file even if it is rather small? Could it be due to a high number of scaffolds in the bam file and TxDb object? I am wondering because when using only one scaffold (e.g. seqlev <- "Scaffold2") the code works and does not overload R.

Thanks.

Best,

Dani

ADD REPLY • link 8 weeks ago danalvoronov ▴ 10

0

Entering edit mode

May I know the package Version of the the ATACseqQC? Do you mind to share the 9.5Mb file to me for debug?

Jianhong.

ADD REPLY • link 8 weeks ago Ou, Jianhong ★ 1.3k

0

Entering edit mode

Hi Jianhong,

Thank you. I have sent you an email (I found your email on https://bioconductor.org/packages/release/bioc/html/ATACseqQC.html) with the 9.5 Mb file I mentioned. I am using ATACseqQC version 1.27.1. Also unfortunately, the part I said about using a single scaffold also does not work at the moment with the versioin 1.27.1 as stated in the other support forum post I posted.

Thanks.

Best, Dani

ADD REPLY • link 8 weeks ago danalvoronov ▴ 10