I am trying to sample records from unmapped BAM files (where first and second read of a pair come in subsequent lines) to get output like form FastqSampler. Here is the naive implementation that just slurps in the entire BAM and then takes a sample:
sampleBam <- function(fn,sampleSize,seed) {
set.seed(seed)
par <- ScanBamParam(what=c("seq","qual","qname"),
tag=c("BC","B2"),
reverseComplement=TRUE)
yield_ <- function(b) {
df <- as(scanBam(b,param=par),"DataFrame")
if (nrow(df)==0L) return(DataFrame())
rv <- 2L*(1L:(nrow(df)%/%2L))
fw <- rv-1L
df2 <- df[fw,]
df2$seq2 <- as(df[rv,"seq"],"DNAStringSet")
df2$qual2 <- as(df[rv,"qual"],"PhredQuality")
return(df2)
}
b <- BamFile(fn)
open(b,"rb")
x <- yield_(b)
close(b)
x <- x[sample(nrow(x),sampleSize),]
# return as list of two ShortReadQ object
barcode <- (if (!is.null(x$tag.B2)) paste0(x$tag.BC,"+",x$tag.B2)
else x$tag.BC)
list(
Fw=ShortReadQ(id=as(paste0(x$qname,"#",barcode,"/1"),"BStringSet"),
sread=x$seq,
quality=FastqQuality(x$qual)),
Rv=ShortReadQ(id=as(paste0(x$qname,"#",barcode,"/2"),"BStringSet"),
sread=x$seq2,
quality=FastqQuality(x$qual2))
)
}
This achieves the desired result, but with a BAM of ~530 mio. reads takes >300 Gb of memory and ~1 h to take the sample, clearly not viable for production use. The normal way to do this would be to read the BAM in chunks:
sampleBamInChunks <- function(fn,sampleSize,seed,yieldSize=1e6) {
yieldSize <- 2L*as.integer(yieldSize/2)
stopifnot(sampleSize<=yieldSize/2)
set.seed(seed)
par <- ScanBamParam(what=c("seq","qual","qname"),
tag=c("BC","B2"),
reverseComplement=TRUE)
yield_ <- function(b) {
df <- as(scanBam(b,param=par),"DataFrame")
if (nrow(df)==0L) return(DataFrame())
rv <- 2L*(1L:(nrow(df)%/%2L))
fw <- rv-1L
df2 <- df[fw,]
df2$seq2 <- as(df[rv,"seq"],"DNAStringSet")
df2$qual2 <- as(df[rv,"qual"],"PhredQuality")
return(df2)
}
# this is just copied from REDUCEsampler and adapted for DataFrame yields:
mksampler_ <- function() {
tot <- 0L
function(x,y,...) {
if (tot==0L) {
tot <<- nrow(x)
if (nrow(x)>sampleSize) {
x <- x[sample(nrow(x),sampleSize),]
}
}
if (nrow(y)>0L) {
yld_n <- nrow(y)
tot <<- tot+yld_n
keep <- rbinom(1L,min(sampleSize,yld_n),yld_n/tot)
i <- sample(sampleSize,keep)
j <- sample(yld_n,keep)
x[i,] <- y[j,]
}
return(x)
}
}
reduce_ <- mksampler_()
b <- BamFile(fn,yieldSize=yieldSize)
open(b,"rb")
x <- reduceByYield(b,YIELD=yield_,REDUCE=reduce_)
close(b)
# return as list of two ShortReadQ object
barcode <- (if (!is.null(x$tag.B2)) paste0(x$tag.BC,"+",x$tag.B2)
else x$tag.BC)
list(
Fw=ShortReadQ(id=as(paste0(x$qname,"#",barcode,"/1"),"BStringSet"),
sread=x$seq,
quality=FastqQuality(x$qual)),
Rv=ShortReadQ(id=as(paste0(x$qname,"#",barcode,"/2"),"BStringSet"),
sread=x$seq2,
quality=FastqQuality(x$qual2))
)
}
This takes only about 25 min. but still >100 Gb. Why is it using so much memory also in this case? Avoiding the high memory usage was the whole point, but there appears to be some sort of memory leak. Does anybody know what might be going awry here?
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.3.so
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=en_US.utf8 LC_MESSAGES=C
[7] LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] ShortRead_1.48.0 GenomicAlignments_1.26.0
[3] GenomicFiles_1.26.0 rtracklayer_1.50.0
[5] BiocParallel_1.24.1 SummarizedExperiment_1.20.0
[7] Biobase_2.50.0 MatrixGenerics_1.2.1
[9] matrixStats_0.58.0 Rsamtools_2.6.0
[11] Biostrings_2.58.0 XVector_0.30.0
[13] GenomicRanges_1.42.0 GenomeInfoDb_1.26.2
[15] IRanges_2.24.1 S4Vectors_0.28.1
[17] BiocGenerics_0.36.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 lattice_0.20-41 png_0.1-7
[4] prettyunits_1.1.1 assertthat_0.2.1 BiocFileCache_1.14.0
[7] R6_2.5.0 RSQLite_2.2.3 httr_1.4.2
[10] pillar_1.4.7 zlibbioc_1.36.0 rlang_0.4.10
[13] GenomicFeatures_1.42.1 progress_1.2.2 curl_4.3
[16] blob_1.2.1 Matrix_1.2-18 stringr_1.4.0
[19] RCurl_1.98-1.2 bit_4.0.4 biomaRt_2.46.3
[22] DelayedArray_0.16.1 compiler_4.0.3 pkgconfig_2.0.3
[25] askpass_1.1 openssl_1.4.3 tidyselect_1.1.0
[28] tibble_3.0.6 GenomeInfoDbData_1.2.4 XML_3.99-0.5
[31] crayon_1.4.1 dplyr_1.0.4 dbplyr_2.1.0
[34] bitops_1.0-6 rappdirs_0.3.3 grid_4.0.3
[37] lifecycle_1.0.0 DBI_1.1.1 magrittr_2.0.1
[40] stringi_1.5.3 cachem_1.0.4 hwriter_1.3.2
[43] latticeExtra_0.6-29 xml2_1.3.2 ellipsis_0.3.1
[46] vctrs_0.3.6 generics_0.1.0 RColorBrewer_1.1-2
[49] tools_4.0.3 bit64_4.0.5 BSgenome_1.58.0
[52] glue_1.4.2 purrr_0.3.4 jpeg_0.1-8.1
[55] hms_1.0.0 fastmap_1.1.0 AnnotationDbi_1.52.0
[58] memoise_2.0.0 VariantAnnotation_1.36.0
There is this old report, which sounds a bit similar: Rsamtools memory leak
Looks like the culprit may be these S4 object deferred evaluation shenanigans, the reduce procedure was maybe dragging along the entire BAM in some hidden environments? Anyway, using basic data.frame instead of DataFrame in the yield_ function, which also converts the seq(2) and qual(2) BStringSets into character vectors, removes the memory leak and brings memory usage down to a stable <3.5 Gb instead of rising to >100 Gb. Sadly, it also increases runtime from 25 min to 45 min, probably becauase it is now juggling character vectors instead of pointers, so I am not sure that counts as a final solution. Is there no way to strip the hidden stuff from the S4 objects and force them to a pristine state?
Does it help to call
compact()
on the sampled object? I think DNAStringSet is creating an index into the entire sequence, instead of creating a new sequence with just the sampled nucleotides.Yes, that seems to do the trick. At least returning compact(x) in the reducer instead of x has plugged the memory leak, seems now to be hovering around 3 Gb. I'll check the runtime tomorrow and report. Many thanks!
Ok, so my final comment is that runtime went up by a couple of minutes, even when using check=FALSE with compact. But the memory leak is gone.
By the way, there is a little pitfall with REDUCEsampler when the file size is between sampleSize and yieldSize. If only one yield can be taken, the reducer is never executed and no sample is therefore taken. The solution could be to manually sample the first yield and use it as the init= argument in reduceByYield, but then REDUCEsampler would need a tot= argument so that one can pass in the size of that first yield.
FASTQ sampling with
seems to have the same problem, btw.