Question

Split a BAM file with read groups

0

Entering edit mode

sbcn ▴ 80

@sbcn-4752

Last seen 20 months ago

Spain

Hi everyone,

Is there a function to split BAM files by read groups in R? There is a way with samtools (outside of R: samtools split) but not with Rsamtools, as far as I can see.

I could read in the file and then parse it, but I'm wondering if there is an already built / more efficient function.

I search for it but can't find anything really dedicated to this...

Thank you!

readgroup bam Rsamtools • 2.7k views

ADD COMMENT • link 3.1 years ago • updated 3.0 years ago sbcn ▴ 80

score 1 · Answer 1 · 2021-03-26

1

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 2 days ago

United States

There is no built-in support that I know of for this functionality.You would have to read in chunks of the file and spit them based on read group. See the genomic files vignette section 5.4 of this vignette: https://bioconductor.org/packages/release/bioc/vignettes/GenomicFiles/inst/doc/GenomicFiles.pdf for one approach. The MAP function would read a chunk of your file. The REDUCE function would write out separate read groups in each chunk.

ADD COMMENT • link 3.1 years ago • updated 3.0 years ago Martin Morgan 25k

1

Entering edit mode

Dear Martin,

I am giving this a try. Here is my modified MAP function:

MAP <- function(range, file, myRG) {
        requireNamespace("BiocParallel")  ## for bplapply()
        nranges = 6
        # split: divide a vector-like object into groups
        idx = split(x=seq_along(range), f=ceiling(seq_along(range)/nranges))

        BiocParallel::bplapply(idx,function(i, range, file) {
             requireNamespace("GenomicAlignments")  ## ScanBamParam(), coverage()
             chunk = range[i]
             param = Rsamtools::ScanBamParam(what=scanBamWhat(), tagFilter=list(RG=myRG), which=chunk[[1]])

             tt <- Rsamtools::scanBam(file, param=param)
             Reduce("+", tt)            ## collapse coverage within chunks
        }, range, file)
}

This works (I get the entries that correspond to the given range and my read group):

NC03_seqlen <- seqlengths(seqinfo(BamFileList("test.bam"))["NC_039898.1"])
tiles <- tileGenome(NC03_seqlen, ntile=5)
 tt <- MAP(tiles, "test.bam", myRG="run1_4894-JE-1_14")

Now when I try the REDUCE function as it is:

REDUCE <- function(mapped, ...) {
        sapply(mapped, Reduce, f = "+")
}

And run the whole thing as:

rgtest <- reduceFiles(unlist(tiles), "test.bam", myRG="run1_4894-JE-1_16", MAP, REDUCE)

I get the following error:

Error in getListElement(x, i, ...) : 
  GRanges objects don't support [[, as.list(), lapply(), or unlist() at
  the moment

I browsed around a bit and it looks like it could be a version incompatibility issue. Do you know a workaround? I copy below my sessionInfo() Thank you!

sessionInfo()

R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.2 (Nitrogen)

Matrix products: default
BLAS/LAPACK: /nfs/software/as/el7.2/EasyBuild/CRG/software/OpenBLAS/0.3.9-GCC-9.3.0/lib/libopenblas_sandybridgep-r0.3.9.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] GenomicFiles_1.26.0         rtracklayer_1.50.0         
 [3] BiocParallel_1.24.1         GenomicAlignments_1.26.0   
 [5] Rsamtools_2.6.0             Biostrings_2.58.0          
 [7] XVector_0.30.0              SummarizedExperiment_1.20.0
 [9] Biobase_2.50.0              MatrixGenerics_1.2.1       
[11] matrixStats_0.58.0          GenomicRanges_1.42.0       
[13] GenomeInfoDb_1.26.4         IRanges_2.24.1             
[15] S4Vectors_0.28.1            BiocGenerics_0.36.0        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6             lattice_0.20-41          prettyunits_1.1.1       
 [4] assertthat_0.2.1         digest_0.6.25            BiocFileCache_1.14.0    
 [7] R6_2.4.1                 RSQLite_2.2.0            httr_1.4.1              
[10] pillar_1.4.3             zlibbioc_1.36.0          rlang_0.4.10            
[13] GenomicFeatures_1.42.2   progress_1.2.2           curl_4.3                
[16] blob_1.2.1               Matrix_1.2-18            stringr_1.4.0           
[19] RCurl_1.98-1.2           bit_1.1-15.2             biomaRt_2.46.3          
[22] DelayedArray_0.16.3      compiler_4.0.0           pkgconfig_2.0.3         
[25] askpass_1.1              openssl_1.4.1            tidyselect_1.1.0        
[28] tibble_3.0.1             GenomeInfoDbData_1.2.4   XML_3.99-0.3            
[31] crayon_1.3.4             dplyr_1.0.5              dbplyr_1.4.3            
[34] bitops_1.0-6             rappdirs_0.3.1           grid_4.0.0              
[37] lifecycle_1.0.0          DBI_1.1.0                magrittr_1.5            
[40] stringi_1.4.6            xml2_1.3.2               ellipsis_0.3.0          
[43] vctrs_0.3.7              generics_0.0.2           tools_4.0.0             
[46] bit64_0.9-7              BSgenome_1.58.0          glue_1.4.0              
[49] purrr_0.3.4              hms_0.5.3                AnnotationDbi_1.52.0    
[52] BiocManager_1.30.12      memoise_1.1.0            VariantAnnotation_1.36.0

ADD REPLY • link 3.0 years ago sbcn ▴ 80

1

Entering edit mode

Sorry to be slow in responding.

Split a BAM fiile

Returning to the first question 'how to split a bam file by read group', I think the best way to do this with Rsamtools is to repeatedly use filterBam() to create a BAM file for each tag of interest. The example file

fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE)

doesn't have read groups, but we can use the MF tag for illustration. The full file has several values of MF

> param0 <- ScanBamParam(tag = "MF")
> MF <- scanBam(fl,  param = param0)[[1]]$tag$MF
> table(MF)
MF
  18   32   64  130  192 
3121   47   39   64   36

And we can easily create a new file with only a subset of these

> param1 <- ScanBamParam(tagFilter = list(MF = c(32, 64)))
> destination <- tempfile()
> filterBam(fl, destination, param = param1)
[1] "/var/folders/yn/gmsh_22s2c55v816r6d51fx1tnyl61/T//RtmppRldQ0/file563e5ac178a8"
> MF1 <- scanBam(destination, param = param0)[[1]]$tag$MF
> table(MF1)
MF1
32 64 
47 39

coverage?

From your response it looks like you want to do something else with the data, not split into separate files. In the MAP function you mention coverage. You could calculate coverage by read group (again using MF as a proxy) in a way that doesn't require a large amount of memory by with

library(GenomicFiles)

bf <- BamFile(fl, yieldSize = 100) # choose a larger value, e.g, 1000000
result1 <- coverage(bf, param = param1)

It might be helpful to see how one might implement this in the GenomicFiles context using reduceByYield():

library(GenomicFiles)

bf <- BamFile(fl, yieldSize = 500) # choose a larger value, e.g, 1000000
result1 <- coverage(bf, param = param1)

YIELD <- function(X) {
    message(".")
    readGAlignments(X, param = param0)
}

MAP <- coverage

REDUCE <- `+`

and then

> bf <- BamFile(fl, yieldSize = 500) # choose a larger value, e.g, 1000000
> open(bf)
> reduceByYield(bf, YIELD, MAP, REDUCE)
.
.
.
.
.
.
.
.
RleList of length 2
$seq1
integer-Rle of length 1575 with 1056 runs
  Lengths:  2  2  1  3  4  2  3  4  2  4  1 ...  2  1  1  1  1  1  1  1  1  6
  Values :  1  2  3  4  5  7  8  9 11 12 13 ... 12 10  9  7  6  5  3  2  1  0

$seq2
integer-Rle of length 1584 with 1092 runs
  Lengths:  1  3  1  1  1  3  1  4  1  1  6 ...  1  1  1  1  2  1  4  4  1 17
  Values :  3  4  5  8 12 14 15 16 17 18 19 ... 14 13 10  8  7  6  3  2  1  0

Substituting param1 for param0 in YIELD would provide read-group specific coverage.

What you're actually after

Actually, it's a little hard for me to follow your code without access to the file. Maybe you could revise your code to use the example file I did, and perhaps with any insights from my work?

ADD REPLY • link 3.0 years ago Martin Morgan 25k

0

Entering edit mode

Dear Martin,

Thank you so much!

I came up with the same conclusion, about using the filterBam instead, as I was struggling with the MAP and REDUCE functions.

I am trying to implement this into the TEQC package: make a report for the different read groups found in the input BAM file without the need of creating new BAM files. As I found it is not so easy, I will rather give the filterBam option, at least for the time being, as a step prior to using the TEQC functions. Eventually, I would like it to be automated, without the need of creating new BAM files, but let's see....

So this is what works for me at the moment, but I'll need to test speed and all (I am not really experienced with BiocParallel::bplapply but I will test it, instead of sapply):

# The bam file for which to extract read groups:
mybam <- "test.bam"

# read BAM header information
sinfo <- Rsamtools::scanBamHeader(mybam)[[1]]

# retrieve all possible read groups from the BAM file header, and unlist:
allRG <- sinfo$text[grep("@RG", names(sinfo$text))]
lsRG <- gsub("ID:", "", unlist(lapply(allRG, function(x)unlist(x)[[1]])))

# Function that creates a new bam file per read group (using the ScanBamParam from the Rsamtools package):

extractRGbam <- function(myBam, myRG){
  # set parameters
  filtparam <-  Rsamtools::ScanBamParam(what="seq", tagFilter=list(RG=myRG))
  # filterBam will then create a new bam file for each read group
  outBam <- paste0(gsub(".bam", "", basename(myBam)), "_", myRG, ".bam")
  Rsamtools::filterBam(myBam, outBam,  param=filtparam)
}

 # "apply" function to each read group
sapply(lsRG, function(x)extractRGbam("test2.bam", x))