Rsamtools memory leak
1
1
Entering edit mode
mhills ▴ 10
@mhills-9033
Last seen 8.5 years ago
Canada

Hi there,

I've just updated Rsamtools to the latest version ("1.22.0") and I've noticed a weird memory leak that is crashing my program.

I have a folder of ~150 bam files that I'm trying to sequentially process, feeding in a data.frame called filter to scanBam(which), that splits the genome into 7744 genomic coordinates.

> head(filter)
        V1       V2       V3                         V4
1 GL896898        0 30180132        GL896898:0-30180132
2 GL896898 30180133 52375790 GL896898:30180133-52375790
3 GL896899        0 24984016        GL896899:0-24984016
4 GL896899 24984017 40953715 GL896899:24984017-40953715
5 GL896905        0 18438707        GL896905:0-18438707
6 GL896905 18438708 27910907 GL896905:18438708-27910907

For some reason, Rsamtools is not releasing RAM after each iteration. I assumed it was my code, but I just ran the following simple test, and unless I'm being completely stupid, Rsamtools is doing something funny. To test this I looped a single bam file multiple time with the following script and monitored RAM usage:

#R implementation of top for system RAM
library(NCmisc)
#Finds total RAM for all objects stored in Rsession
library(pryr)
library(Rsamtools)

filter=read.table('ferret_merged_sce_events.bed')
objectRam=vector()
totalRam=vector()

for(seq in 1:10)
{
     message(paste("iteration", seq))

     #Run scanBam to read in bam file
     bam.dataframe <- scanBam('bamFile.bam', param=ScanBamParam(which=GRanges(seqnames = c(as.character(filter[,1])), ranges = IRanges(c(filter[,2]), c(filter[,3]) )), mapqFilter=10, what=c("rname","pos","strand")))

     #measure RAM across all R objects
     objectRam <- c(objectRam, mem_used()[1]/10^6)
     #measure RAM across system
     totalRam <- c(totalRam,  suppressWarnings(top(CPU=F)$RAM$used*10^6))

     #remove scanBam object to reset for next iteration
     rm(bam.dataframe)

     #empty garbage collection
     gc()

}
plot(objectRam, type='l', xlab='iteration', ylab='Ram usage (Mb)')
plot(totalRam, type='l', xlab='iteration', ylab='Ram usage (Mb)')

This generates the two attached graphs. You'll see that while the RAM used for the R object remains stable, the overall RAM usage of my computer increases, and only decreases when I end my R session. Is there any way of purging the RAM? Is this a bug in Rsamtools or have I done something wrong?

The amount of memory used for all the objects in my R session is constant:

 

...But the total system usage of RAM increases with each iteration:


 

I don't think this happened in earlier versions of Rsamtools, as I'd previously analyzed these data without a problem.  I added the rm() and gc() lines in the hope it would fix the problem, but it doesn't seem to change anything. I tried running this without the which= option in scanBamParam() and there doesn't seem to be this memory retention issue, so it's specific for scanBam when which is specified as a parameter. Any help would be greatly appreciated!

 

 

rsamtools scanbam bam granges iranges • 2.5k views
ADD COMMENT
0
Entering edit mode

Just to be clear; this doesn't happen with BioC 3.1, right? I'm asking because I'm also trying to diagnose a BioC 3.2-specific problem in some code that relies heavily on scanBam (and the which= option). It's proving quite difficult to pin down the cause, which would be consistent with memory problems.

ADD REPLY
0
Entering edit mode

Hi Aaron,

I'm not sure; I updated Bioconductor and Rsamtools at the same time.  I'm on Bioconductor v3.2 on R v3.2.2 using Rsamtools v1.22.0. My code was working fine on Bioconductor v3.1 with Rsamtools v1.18, but after updating it poops out after it loops through ~40 samples with a memory usage error.  I haven't installed earlier versions yet because I really wanted the mapqFilter option, which wasn't available in Rsamtools v1.18 but massively speeds up my code since I don't need to convert to the result of scanBam() to a data.frame to filter.  I wrote the simple loop script to convince myself that the problem wasn't my code, but something strange happening with the Rsamtools function.

ADD REPLY
0
Entering edit mode

Hmm... your problem might be related to mine. For anyone who's interested, my issue is that the ChIP-seq DB workflow runs extremely slowly in BioC 3.2, while being fine in BioC 3.1. Specifically, the second set of align calls from Rsubread takes a very long time (~18 hours per file, which is close to a 5- to 10-fold slowdown compared to BioC 3.1). Running those align calls separately doesn't result in any problems; it's only when they're in the workflow (with lots of preceding calls to scanBam by various csaw functions) that this drop in speed occurs. My uneducated guess is that, if memory is not being released by Rsamtools, then Rsubread is slowing down to operate in what it perceives to be a low-memory environment. Oh - this probably doesn't help you too much, though; apologies for going off-topic.

ADD REPLY
0
Entering edit mode

Could well be related. Have you monitored RAM? Our lab is pretty small fry, so I'm only on a computer with 16Gb RAM to do all the analysis. After getting the inital error, I monitored with a separate terminal running top -p `pgrep R | tr "\\n" "," | sed 's/,$//' command, and just watched my memory slowly sucked away.  No problem about going off topic; it sounds like this could be related.  I guess if you saved the product of Rsamtools, then restarted R, called the saved object and continued you could see if your downstream workflow functions are still slow. 

ADD REPLY
0
Entering edit mode

RAM doesn't seem to be the problem; I'm running the workflow on a server with plenty of memory, and usage is around 10G's, which is a small proportion of the total. Rather, the concern has more to do with whether the memory problem is interfering with allocations later on, i.e., in Rsubread. An additional distinction between our problems is that the Rsubread allocations are done using malloc/free instead of going through R's API.

ADD REPLY
0
Entering edit mode

Can you simplify your example a little, by creating the ScanBamParam once, outside the loop, and not saving the result to bam.dataframe? Is it possible to reproduce the phenomenon on publicly available BAM files, e.g., in RNAseqData.HNRNPC.bam.chr14 ? (an emerging pet peeve -- message("foo ", "bar") is the same as message(paste("foo", "bar")) ;).

ADD REPLY
0
Entering edit mode
mhills ▴ 10
@mhills-9033
Last seen 8.5 years ago
Canada

EDIT:

I'm adding this to the main section, because it'll probably look choppy in the comments, but I changed things around as recommended by Martin Morgan, and for some reason, I'm only getting an increase in system memory if param is specified within the loop, but not if it's specified outside of it.

What I was doing was the following, which is eating my system memory:

library(NCmisc)
library(pryr)
library(Rsamtools)
library(RNAseqData.HNRNPC.bam.chr14)
options(scipen=20)
#make equivalent table of 10,000 regions in data
filter=data.frame(chr='chr14', start=seq(0, 105000000, 10000), end=seq(10000, 105010000, 10000))
objectRam=vector()
totalRam=vector()
for(fileName in RNAseqData.HNRNPC.bam.chr14_BAMFILES)
{
    bam.dataframe <- scanBam(fileName, param=ScanBamParam(which=GRanges(seqnames=c(as.character(filter[,1])), ranges=IRanges(c(filter[,2]), c(filter[,3]) )), mapqFilter=10, what=c("rname","pos","strand")))
    objectRam <- c(objectRam, mem_used()[1]/10^6)
    totalRam <- c(totalRam,  suppressWarnings(top(CPU=F)$RAM$used*10^6))
    rm(bam.dataframe)
    gc()
}
plot(objectRam, type='l', xlab='iteration', ylab='Ram usage (Mb)')
plot(totalRam, type='l', xlab='iteration', ylab='Ram usage (Mb)')

but if I just take param out of the loop, my memory issues seem to go away:

library(NCmisc)
library(pryr)
library(Rsamtools)
library(RNAseqData.HNRNPC.bam.chr14)

options(scipen=20)
#make equivalent table of 10,000 regions in data
filter=data.frame(chr='chr14', start=seq(0, 105000000, 10000), end=seq(10000, 105010000, 10000))
objectRam=vector()
totalRam=vector()

parameters=ScanBamParam(which=GRanges(seqnames=c(as.character(filter[,1])), ranges=IRanges(c(filter[,2]), c(filter[,3]) )), mapqFilter=10, what=c("rname","pos","strand"))

for(fileName in RNAseqData.HNRNPC.bam.chr14_BAMFILES)
{
    bam.dataframe <- scanBam(fileName, param=parameters)
    objectRam <- c(objectRam, mem_used()[1]/10^6)
    totalRam <- c(totalRam,  suppressWarnings(top(CPU=F)$RAM$used*10^6))
    rm(bam.dataframe)
    gc()
}

plot(objectRam, type='l', xlab='iteration', ylab='Ram usage (Mb)')
plot(totalRam, type='l', xlab='iteration', ylab='Ram usage (Mb)')

I'm not sure why this is, but it seems to be working much better, so I'll try it on my actual data.  Thanks so much!

PS: I didn't know message() natively read stored variables. Thanks! That'll clean things up!

ADD COMMENT
0
Entering edit mode

I should have asked for sessionInfo() and that you were running this in the console rather than, e.g., Rstudio, to eliminate other possible problems. I don't actually understand why memory consumption would be excessive or not depending on whether param was defined in or out of the loop, and was not able to see any problem (including under valgrind, which is used to detect memory corruption problems) using the HNRNPC data.

(unreleated) If your regions are spanning the genome (see also GenomicRanges::tileGenome()), then it's often more efficient to iterate through the file in chunks (using BamFile(..., yieldSize=1000000), for instance) doing operations on each chunk (provided you don't need to have all reads from a tile in memory at the same time).

library(RNAseqData.HNRNPC.bam.chr14)
library(Rsamtools)
library(GenomicFiles)

YIELD = function(x, ...) {
    param <- ScanBamParam(mapqFilter=10, what=c("rname","pos","strand"))
    scanBam(x, param=param)[[1]]
}

MAP = function(x, ..., grfilter) {
    ## 'x' is the return value from YIELD
    grinput <- with(x, GRanges(rname, IRanges(pos, width=1), strand))
    countOverlaps(grfilter, grinput)
}

## REDUCE = `+` is the default, and appropriate -- add successive
## outputs of MAP

DONE = function(x, ...) {
    ## what return value from 'YIELD' indicates that we're done?
    length(x[["rname"]]) == 0L
}

filter=data.frame(chr='chr14',
  start=seq(0, 105000000, 10000),
  end=seq(10000, 105010000, 10000))
grfilter <- makeGRangesFromDataFrame(filter)

## yieldSize == 1000000 more realistic
bf = BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1], yieldSize=100000)
result <- reduceByYield(bf, YIELD, MAP, DONE=DONE, grfilter=grfilter)

There are tricky and subtle issues with both approaches. For instance, the above counts the left-most coordinate ("pos") of each read, rather than say it's start (which would be easy to calculate using GenomicAlignments::readGAlignments() and start(), for instance). On the other hand, if iterating through genomic ranges then a read that spans to ranges will overlap each, will be returned by the scanBam query on each range, and will be counted twice.

ADD REPLY
0
Entering edit mode

Thanks Martin!

I should have mentioned I've been running this all through terminal, not Rstudio.

Thanks for the additional insight with the GRanges approach. I confess I don't come from a computational background, but have used GenomicRanges a bit, so will give this a go. 

Thanks again!

ADD REPLY
0
Entering edit mode

When I reran my validation this morning I hadn't realized it was on a computer with an older version of Rsamtools installed.  I just tried again on the computer with Rsamtools 1.22.0 and I'm still seeing the memory leak, including when I run the above loop with RNAseqData.HNRNPC.bam.chr14.  Would someone be willing to test to see if the set up of my computer happens to be causing this, or something else.

I am running this through a terminal (not RStudio).

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] RNAseqData.HNRNPC.bam.chr14_0.8.0 Rsamtools_1.22.0                 
 [3] Biostrings_2.38.0                 XVector_0.10.0                   
 [5] GenomicRanges_1.22.0              GenomeInfoDb_1.6.0               
 [7] IRanges_2.4.0                     S4Vectors_0.8.0                  
 [9] BiocGenerics_0.16.0               pryr_0.1.2                       
[11] NCmisc_1.1.4                     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.1          codetools_0.2-14     bitops_1.0-6        
 [4] futile.options_1.0.0 magrittr_1.5         stringi_0.5-5       
 [7] zlibbioc_1.16.0      proftools_0.1-0      futile.logger_1.4.1
[10] lambda.r_1.1.7       BiocParallel_1.4.0   tools_3.2.2         
[13] stringr_1.0.0       

 

 

ADD REPLY
0
Entering edit mode

I did rerun your code under a similar system (nothing is ever identical....) without problems. If we could make this reproducible then certainly I could address it. I see that there are a lot of packages loaded / attached in your session info, and it's very important when tracking down bugs to eliminate these variables; so no pryr (which I think is unused) and if possible NCmisc. If others can reproduce the problem then that certainly helps to triangulate the issue.

ADD REPLY
0
Entering edit mode

Here's my output of the first code chunk that mhills posted (i.e., where param is constructed inside the loop across RNAseqData.HNRNPC.bam.chr14_BAMFILES).

> objectRam
[1] 134.6744 135.9326 135.7642 134.6899 134.3710 135.1653 135.1248 134.7406
> totalRam
[1] 14975584 14998160 14997404 15000580 15005976 15010472 15013668 15018276

So it seems like totalRam is generally increasing on my system, though not by much; these units appear to be in kilobytes, as top reports usage in gigabytes. For completeness, here's the output of the second code chunk:

> objectRam
[1] 134.2849 135.5431 135.3747 134.3004 133.9815 134.7759 134.7353 134.3511
> totalRam
[1] 14975788 14994824 14997948 14999552 15004868 15010988 15014192 15017532

So, another mild increase in total RAM usage. Here's the sessionInfo():

R version 3.2.2 Patched (2015-10-19 r69552)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.4 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] RNAseqData.HNRNPC.bam.chr14_0.8.0 Rsamtools_1.22.0                 
 [3] Biostrings_2.38.0                 XVector_0.10.0                   
 [5] GenomicRanges_1.22.0              GenomeInfoDb_1.6.0               
 [7] IRanges_2.4.1                     S4Vectors_0.8.0                  
 [9] BiocGenerics_0.16.0               pryr_0.1.2                       
[11] NCmisc_1.1.4                     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.1          codetools_0.2-14     bitops_1.0-6        
 [4] futile.options_1.0.0 magrittr_1.5         stringi_1.0-1       
 [7] zlibbioc_1.16.0      proftools_0.1-0      futile.logger_1.4.1
[10] lambda.r_1.1.7       BiocParallel_1.4.0   tools_3.2.2         
[13] stringr_1.0.0      

I've also run this on an older R installation with Bioc 3.1, and I get increases of similar size in totalRam.

ADD REPLY
0
Entering edit mode

Thanks Aaron,

I think one difference is that I'm loading in more and bigger bam files than the dataset from RNAseqData.HNRNPC.bam.chr14.  I edited the code above with a second loop, to better assess the ~150 samples I'm dealing with:

for(loop in 1:20)
{
  message('loop ', loop)

  for(fileName in RNAseqData.HNRNPC.bam.chr14_BAMFILES)
  {
       bam.dataframe <- scanBam(fileName, param=param)
       objectRam <- c(objectRam, mem_used()[1]/10^6)
       totalRam <- c(totalRam,  suppressWarnings(top(CPU=F)$RAM$used*10^6))
       rm(bam.dataframe)
       gc()
  }

}

I definitely still see increasing system ram:

> totalRam
  [1] 2823.980 2985.480 3233.188 3486.524 3507.880 3546.328 3595.396 3631.600
  [9] 3638.004 3653.920 3728.920 3755.236 3825.108 3913.516 3922.632 3919.504
 [17] 3928.452 3936.132 3934.684 3941.528 3957.036 3949.744 3967.220 3970.708
 [25] 3973.048 3987.000 3990.212 3965.288 3950.252 3955.216 3967.356 3969.732
 [33] 3975.144 3985.236 3969.548 3984.564 3990.212 3996.632 4007.584 4024.676
 [41] 4000.856 4017.112 4008.488 4018.584 4041.712 4045.828 4053.244 4063.240
 [49] 4066.296 4026.908 4019.560 4001.384 4009.300 4018.300 4035.736 4041.556
 [57] 4054.744 4056.908 4026.272 4048.200 4036.396 4025.576 4040.564 4048.736
 [65] 4051.424 4059.716 4064.600 4046.948 4055.552 4065.016 4076.840 4049.136
 [73] 4052.700 4059.380 4058.504 4067.372 4056.984 4053.052 4057.316 4066.280
 [81] 4073.544 4059.732 4060.984 4064.260 4071.856 4086.328 4092.296 4093.240
 [89] 4085.472 4092.020 4104.120 4121.356 4085.852 4075.100 4098.528 4103.700
 [97] 4106.352 4093.212 4109.444 4118.600 4144.288 4110.512 4095.084 4098.044
[105] 4108.876 4098.652 4108.200 4113.100 4121.928 4132.804 4139.252 4146.864
[113] 4144.632 4154.748 4160.892 4141.660 4161.636 4137.512 4135.004 4167.308
[121] 4325.988 4326.716 4340.532 4286.928 4328.560 4363.356 4379.056 4375.420
[129] 4378.736 4373.388 4385.628 4385.216 4375.616 4380.408 4392.852 4396.356
[137] 4424.744 4506.740 4479.452 4451.748 4438.584 4454.300 4466.628 4467.072
[145] 4469.100 4474.180 4478.104 4490.460 4487.156 4503.680 4497.076 4499.412
[153] 4513.456 4512.876 4513.160 4507.360 4507.664 4512.308 4519.280 4582.860

Martin: I did this on a fresh R session, so I assumed the packages loaded were dependencies.  The only two additional packages I loaded were NCmisc, which I'm using to call "top" in R and get an accurate measure of system RAM, and pryr, which I'm using the mem_used() function to give me object RAM. 

Perhaps this is just a problem if you have big bam files, a lot of them, a large which= selection and a limited amount of system memory.

 

 

ADD REPLY

Login before adding your answer.

Traffic: 473 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6