Filtering a BAM file using RSamtools
Entering edit mode
Vivek.b ▴ 100
Last seen 2.2 years ago

Dear BioC team

I have a BAM file which I want to split into multiple BAM files, based on the "tags" that I have appended in the read names (so qname param) . I thought it might be possible using `Rsamtools::filterBam()`  function if I am able to provide the right FilterRules instance. But I am confused about : 1) whether it's even possible 2) how to do it?

For example if my tags are divided in 5 categories I can write 5 functions which return TRUE/FALSE given the string in qname field. Now, the Rsamtools manual says:

Functions in the FilterRules instance should expect a single DataFrame argument representing all information specified by param. Each function must return a logical vector, usually of length equal to the number of rows of the DataFrame. Return values are used to include (when TRUE) corresponding records in the filtered BAM file.

Now I don't know how to fit my functions in this kind of structure.. Any help will be appreciated.


rsamtools • 3.1k views
Entering edit mode
Last seen 1 hour ago
United States

By way of getting a reproducible example, I ran

qnames <- scanBam(fl, param=ScanBamParam(what="qname"))[[1]][[1]]
table(vapply(strsplit(qnames, "_"), "[[", character(1), 1))

with the output

> table(vapply(strsplit(qnames, "_"), "[[", character(1), 1))

    B7   EAS1 EAS112 EAS114 EAS139 EAS188 EAS192 EAS218 EAS219 EAS220 EAS221 
   467    505     61    534    209    107     68     82     91     51    130 
 EAS51  EAS54  EAS56 
   222    355    425 

Suppose I'm interested in the reads starting with 'B7'. I'd use ScanBamParam(what="qname") to extract minimal information from the BAM file for the filter. This would give my filter a DataFrame with a single column, qname, and my filter's task would be to return a logical vector indicating which rows pass the filter, so maybe grepl("^B7_", df$qname) or in R-devel the new-fangled startsWith(df$qname, "B7_"). So...

dest <- tempfile()
filter <- FilterRules(list(iWantB7=function(x) startsWith(x$qname, "B7")))
filterBam(fl, dest, filter=filter, param=ScanBamParam(what="qname"))


> qnames <- scanBam(dest, param=ScanBamParam(what="qname"))[[1]][[1]]
> table(vapply(strsplit(qnames, "_"), "[[", character(1), 1))


I worked through this by consulting the help page (also for FilterRules) and getting off the ground with a basic filter

filter <- FilterRules(list(iWantB7=function(x) {
    print(head(x, 3))
    rep(FALSE, nrow(x))


> filterBam(fl, dest, filter=filter)
DataFrame with 3 rows and 15 columns
                   qname      flag    rname   strand       pos    qwidth
             <character> <integer> <factor> <factor> <integer> <integer>
1    B7_591:4:96:693:509        73     seq1        +         1        36
2 EAS54_65:7:152:368:113        73     seq1        +         3        35
3    EAS51_64:8:5:734:57       137     seq1        +         5        35
       mapq       cigar     mrnm      mpos     isize                     seq
  <integer> <character> <factor> <integer> <integer>          <DNAStringSet>
1        99         36M       NA        NA        NA CACTAGTGGC...GTTTAACTCG
2        99         35M       NA        NA        NA CTAGTGGCTC...TTTAACTCGT
3        99         35M       NA        NA        NA AGTGGCTCAT...TAACTCGTCC
                     qual   groupid mate_status
           <PhredQuality> <integer>    <factor>
1 <<<<<<<<<<...<<<<<;:<;7         0          NA
2 <<<<<<<<<<...9<<3/:<6):         0          NA
3 <<<<<<<<<<...<3;);3*8/5         0          NA
[1] "/tmp/RtmpOySFZo/file46ce6598a760"

Probably a better approach is to use formal tags (rather than munging qnames) to annotate your BAM file (you would do this 'upstream', when you're doing alignments or otherwise, and then use the tagFilter argument to ScanBamParam() to selectively input the data rather than making yet another set of BAM files.


Entering edit mode

Thanks a lot Dr. Morgan for a fast and helpful reply. 

Actually I add these tags in my raw fastq files for demultiplexing the samples post-alignment (my fastq files are multiplexed using a novel multiplexing index). I was exploring the idea that demultiplexing after alignment might be more convenient than demultiplexing raw fastq files, followed by alignment.

The solution you suggested worked nicely. I was wondering how I can make it more efficient to filter a large bam file into 3-4 bamFiles this way..

Thanks again


Entering edit mode

in release the only way to do this is with multiple passes through the file, perhaps using BiocParallel::bplapply as

destinations <- replicate(2, tempfile())
filters <- list(
    FilterRules(list(iWantB7 = function(x) startsWith(x$qname, "B7_"))),
    FilterRules(list(iWantEAS1=function(x) startsWith(x$qname, "EAS1_")))
bplapply(seq_along(destinations), function(i, file, destinations, filters) {
    filterBam(file, destinations[i], filter=filters[[i]])
}, fl, destinations, filters)

I added a more efficient implementation of this to the 'devel' version of Rsamtools, version 1.23.2; this requires use of the devel version of Bioconductor, and should be available via biocLite() after the builds complete tomorrow afternoon, January 15, with

filterBam(fl, destinations, filter=filters)

Login before adding your answer.

Traffic: 370 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6