Question

memory leak when subsampling BAM

1

Entering edit mode

robmaz77 ▴ 20

@dc2bd3d5

Last seen 3.1 years ago

I am trying to sample records from unmapped BAM files (where first and second read of a pair come in subsequent lines) to get output like form FastqSampler. Here is the naive implementation that just slurps in the entire BAM and then takes a sample:

sampleBam <- function(fn,sampleSize,seed) {
  set.seed(seed)
  par <- ScanBamParam(what=c("seq","qual","qname"),
                      tag=c("BC","B2"),
                      reverseComplement=TRUE)
  yield_ <- function(b) {
    df <- as(scanBam(b,param=par),"DataFrame")
    if (nrow(df)==0L) return(DataFrame())
    rv <- 2L*(1L:(nrow(df)%/%2L))
    fw <- rv-1L
    df2 <- df[fw,]
    df2$seq2 <- as(df[rv,"seq"],"DNAStringSet")
    df2$qual2 <- as(df[rv,"qual"],"PhredQuality")
    return(df2)
  }
  b <- BamFile(fn)
  open(b,"rb")
  x <- yield_(b)
  close(b)
  x <- x[sample(nrow(x),sampleSize),]
  # return as list of two ShortReadQ object
  barcode <- (if (!is.null(x$tag.B2)) paste0(x$tag.BC,"+",x$tag.B2)
              else x$tag.BC)
  list(
    Fw=ShortReadQ(id=as(paste0(x$qname,"#",barcode,"/1"),"BStringSet"),
                  sread=x$seq,
                  quality=FastqQuality(x$qual)),
    Rv=ShortReadQ(id=as(paste0(x$qname,"#",barcode,"/2"),"BStringSet"),
                  sread=x$seq2,
                  quality=FastqQuality(x$qual2))
  )
}

This achieves the desired result, but with a BAM of ~530 mio. reads takes >300 Gb of memory and ~1 h to take the sample, clearly not viable for production use. The normal way to do this would be to read the BAM in chunks:

sampleBamInChunks <- function(fn,sampleSize,seed,yieldSize=1e6) {
  yieldSize <- 2L*as.integer(yieldSize/2)
  stopifnot(sampleSize<=yieldSize/2)
  set.seed(seed)
  par <- ScanBamParam(what=c("seq","qual","qname"),
                      tag=c("BC","B2"),
                      reverseComplement=TRUE)
  yield_ <- function(b) {
    df <- as(scanBam(b,param=par),"DataFrame")
    if (nrow(df)==0L) return(DataFrame())
    rv <- 2L*(1L:(nrow(df)%/%2L))
    fw <- rv-1L
    df2 <- df[fw,]
    df2$seq2 <- as(df[rv,"seq"],"DNAStringSet")
    df2$qual2 <- as(df[rv,"qual"],"PhredQuality")
    return(df2)
  }
# this is just copied from REDUCEsampler and adapted for DataFrame yields:
  mksampler_ <- function() {
    tot <- 0L
    function(x,y,...) {
      if (tot==0L) {
        tot <<- nrow(x)
        if (nrow(x)>sampleSize) {
          x <- x[sample(nrow(x),sampleSize),]
        }
      }
      if (nrow(y)>0L) {
        yld_n <- nrow(y)
        tot <<- tot+yld_n
        keep <- rbinom(1L,min(sampleSize,yld_n),yld_n/tot)
        i <- sample(sampleSize,keep)
        j <- sample(yld_n,keep)
        x[i,] <- y[j,]
      }
      return(x)
    }
  }
  reduce_ <- mksampler_()
  b <- BamFile(fn,yieldSize=yieldSize)
  open(b,"rb")
  x <- reduceByYield(b,YIELD=yield_,REDUCE=reduce_)
  close(b)
  # return as list of two ShortReadQ object
  barcode <- (if (!is.null(x$tag.B2)) paste0(x$tag.BC,"+",x$tag.B2)
              else x$tag.BC)
  list(
    Fw=ShortReadQ(id=as(paste0(x$qname,"#",barcode,"/1"),"BStringSet"),
                     sread=x$seq,
                     quality=FastqQuality(x$qual)),
    Rv=ShortReadQ(id=as(paste0(x$qname,"#",barcode,"/2"),"BStringSet"),
                     sread=x$seq2,
                     quality=FastqQuality(x$qual2))
  )
}

This takes only about 25 min. but still >100 Gb. Why is it using so much memory also in this case? Avoiding the high memory usage was the whole point, but there appears to be some sort of memory leak. Does anybody know what might be going awry here?

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.3.so

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8    
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=C            
 [7] LC_PAPER=en_US.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] ShortRead_1.48.0            GenomicAlignments_1.26.0   
 [3] GenomicFiles_1.26.0         rtracklayer_1.50.0         
 [5] BiocParallel_1.24.1         SummarizedExperiment_1.20.0
 [7] Biobase_2.50.0              MatrixGenerics_1.2.1       
 [9] matrixStats_0.58.0          Rsamtools_2.6.0            
[11] Biostrings_2.58.0           XVector_0.30.0             
[13] GenomicRanges_1.42.0        GenomeInfoDb_1.26.2        
[15] IRanges_2.24.1              S4Vectors_0.28.1           
[17] BiocGenerics_0.36.0        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6               lattice_0.20-41          png_0.1-7               
 [4] prettyunits_1.1.1        assertthat_0.2.1         BiocFileCache_1.14.0    
 [7] R6_2.5.0                 RSQLite_2.2.3            httr_1.4.2              
[10] pillar_1.4.7             zlibbioc_1.36.0          rlang_0.4.10            
[13] GenomicFeatures_1.42.1   progress_1.2.2           curl_4.3                
[16] blob_1.2.1               Matrix_1.2-18            stringr_1.4.0           
[19] RCurl_1.98-1.2           bit_4.0.4                biomaRt_2.46.3          
[22] DelayedArray_0.16.1      compiler_4.0.3           pkgconfig_2.0.3         
[25] askpass_1.1              openssl_1.4.3            tidyselect_1.1.0        
[28] tibble_3.0.6             GenomeInfoDbData_1.2.4   XML_3.99-0.5            
[31] crayon_1.4.1             dplyr_1.0.4              dbplyr_2.1.0            
[34] bitops_1.0-6             rappdirs_0.3.3           grid_4.0.3              
[37] lifecycle_1.0.0          DBI_1.1.1                magrittr_2.0.1          
[40] stringi_1.5.3            cachem_1.0.4             hwriter_1.3.2           
[43] latticeExtra_0.6-29      xml2_1.3.2               ellipsis_0.3.1          
[46] vctrs_0.3.6              generics_0.1.0           RColorBrewer_1.1-2      
[49] tools_4.0.3              bit64_4.0.5              BSgenome_1.58.0         
[52] glue_1.4.2               purrr_0.3.4              jpeg_0.1-8.1            
[55] hms_1.0.0                fastmap_1.1.0            AnnotationDbi_1.52.0    
[58] memoise_2.0.0            VariantAnnotation_1.36.0

GenomicFiles ShortRead Rsamtools • 1.2k views

ADD COMMENT • link 3.2 years ago robmaz77 ▴ 20

0

Entering edit mode

There is this old report, which sounds a bit similar: Rsamtools memory leak

ADD REPLY • link 3.2 years ago robmaz77 ▴ 20

0

Entering edit mode

Looks like the culprit may be these S4 object deferred evaluation shenanigans, the reduce procedure was maybe dragging along the entire BAM in some hidden environments? Anyway, using basic data.frame instead of DataFrame in the yield_ function, which also converts the seq(2) and qual(2) BStringSets into character vectors, removes the memory leak and brings memory usage down to a stable <3.5 Gb instead of rising to >100 Gb. Sadly, it also increases runtime from 25 min to 45 min, probably becauase it is now juggling character vectors instead of pointers, so I am not sure that counts as a final solution. Is there no way to strip the hidden stuff from the S4 objects and force them to a pristine state?

ADD REPLY • link 3.2 years ago robmaz77 ▴ 20

1

Entering edit mode

Does it help to call compact() on the sampled object? I think DNAStringSet is creating an index into the entire sequence, instead of creating a new sequence with just the sampled nucleotides.

ADD REPLY • link 3.2 years ago Martin Morgan 25k

0

Entering edit mode

Yes, that seems to do the trick. At least returning compact(x) in the reducer instead of x has plugged the memory leak, seems now to be hovering around 3 Gb. I'll check the runtime tomorrow and report. Many thanks!

ADD REPLY • link 3.2 years ago robmaz77 ▴ 20

0

Entering edit mode

Ok, so my final comment is that runtime went up by a couple of minutes, even when using check=FALSE with compact. But the memory leak is gone.

By the way, there is a little pitfall with REDUCEsampler when the file size is between sampleSize and yieldSize. If only one yield can be taken, the reducer is never executed and no sample is therefore taken. The solution could be to manually sample the first yield and use it as the init= argument in reduceByYield, but then REDUCEsampler would need a tot= argument so that one can pass in the size of that first yield.

ADD REPLY • link 3.2 years ago robmaz77 ▴ 20

0

Entering edit mode

FASTQ sampling with

reduceByYield(FastStreamer(fn,yieldSize),YIELD=yield,REDUCE=REDUCEsampler(sampleSize))

seems to have the same problem, btw.

ADD REPLY • link 3.2 years ago robmaz77 ▴ 20