Hello,
I have several large BAM files and would like to use readGAlignmentPairs when reading in the Bam files by chunks to reduce memory usage. However, I've noticed that readGAlignmentPairs does not produce the same output when reading in a Bam file by chunks as it does when it reads in the whole Bam file. Is there a work around for this? I have produced sample code below to show that the outputs are not equal.
library(GenomicAlignments)
library(Rsamtools)
library(GenomicFiles)
library(pasillaBamSubset)
param <- ScanBamParam(flag=scanBamFlag(isDuplicate = F,
isSecondaryAlignment = F))
un3 <- untreated3_chr4()
bf1 <- BamFile(un3, yieldSize=100, asMates=T)
bf2 <- BamFile(un3, asMates=T)
yield <- function(x) readGAlignmentPairs(x, param=param)
map <- identity
reduce <- c
galp1 <- reduceByYield(bf1, yield, map, reduce)
galp2 <- readGAlignmentPairs(bf2, param=param)
identical(galp1, galp2)
galp1[100:110]
galp2[100:110]
SessionInfo:
R version 3.3.1 (2016-06-21) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.11.5 (El Capitan) locale: [1] C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] pasillaBamSubset_0.10.0 GenomicFiles_1.8.0 [3] rtracklayer_1.32.1 BiocParallel_1.6.2 [5] GenomicAlignments_1.8.3 Rsamtools_1.24.0 [7] Biostrings_2.40.2 XVector_0.12.0 [9] SummarizedExperiment_1.2.3 Biobase_2.32.0 [11] GenomicRanges_1.24.2 GenomeInfoDb_1.8.3 [13] IRanges_2.6.1 S4Vectors_0.10.2 [15] BiocGenerics_0.18.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.34.3 zlibbioc_1.18.0 BSgenome_1.40.1 [4] tools_3.3.1 DBI_0.4-1 bitops_1.0-6 [7] RCurl_1.95-4.8 biomaRt_2.28.0 RSQLite_1.0.0 [10] compiler_3.3.1 GenomicFeatures_1.24.3 XML_3.98-1.4 [13] VariantAnnotation_1.18.5
Thanks,
Jacob
Perfect, thanks for the help!