Question

Filtering a BAM file using RSamtools

1

Entering edit mode

Vivek.b ▴ 100

@vivekb-7661

Last seen 3.9 years ago

Germany

Dear BioC team

I have a BAM file which I want to split into multiple BAM files, based on the "tags" that I have appended in the read names (so qname param) . I thought it might be possible using `Rsamtools::filterBam()` function if I am able to provide the right FilterRules instance. But I am confused about : 1) whether it's even possible 2) how to do it?

For example if my tags are divided in 5 categories I can write 5 functions which return TRUE/FALSE given the string in qname field. Now, the Rsamtools manual says:

Functions in the FilterRules instance should expect a single DataFrame argument representing all information specified by param. Each function must return a logical vector, usually of length equal to the number of rows of the DataFrame. Return values are used to include (when TRUE) corresponding records in the filtered BAM file.

Now I don't know how to fit my functions in this kind of structure.. Any help will be appreciated.

Thanks

rsamtools • 4.2k views

ADD COMMENT • link updated 8.3 years ago by Martin Morgan 25k • written 8.3 years ago by Vivek.b ▴ 100

score 5 · Accepted Answer · 2016-01-13

By way of getting a reproducible example, I ran

library(Rsamtools)
example(filterBam)
qnames <- scanBam(fl, param=ScanBamParam(what="qname"))[[1]][[1]]
table(vapply(strsplit(qnames, "_"), "[[", character(1), 1))

with the output

> table(vapply(strsplit(qnames, "_"), "[[", character(1), 1))

    B7   EAS1 EAS112 EAS114 EAS139 EAS188 EAS192 EAS218 EAS219 EAS220 EAS221 
   467    505     61    534    209    107     68     82     91     51    130 
 EAS51  EAS54  EAS56 
   222    355    425

Suppose I'm interested in the reads starting with 'B7'. I'd use ScanBamParam(what="qname") to extract minimal information from the BAM file for the filter. This would give my filter a DataFrame with a single column, qname, and my filter's task would be to return a logical vector indicating which rows pass the filter, so maybe grepl("^B7_", df$qname) or in R-devel the new-fangled startsWith(df$qname, "B7_"). So...

dest <- tempfile()
filter <- FilterRules(list(iWantB7=function(x) startsWith(x$qname, "B7")))
filterBam(fl, dest, filter=filter, param=ScanBamParam(what="qname"))

and...

> qnames <- scanBam(dest, param=ScanBamParam(what="qname"))[[1]][[1]]
> table(vapply(strsplit(qnames, "_"), "[[", character(1), 1))

 B7 
467

I worked through this by consulting the help page (also for FilterRules) and getting off the ground with a basic filter

filter <- FilterRules(list(iWantB7=function(x) {
    print(head(x, 3))
    rep(FALSE, nrow(x))
}))

and

> filterBam(fl, dest, filter=filter)
DataFrame with 3 rows and 15 columns
                   qname      flag    rname   strand       pos    qwidth
             <character> <integer> <factor> <factor> <integer> <integer>
1    B7_591:4:96:693:509        73     seq1        +         1        36
2 EAS54_65:7:152:368:113        73     seq1        +         3        35
3    EAS51_64:8:5:734:57       137     seq1        +         5        35
       mapq       cigar     mrnm      mpos     isize                     seq
  <integer> <character> <factor> <integer> <integer>          <DNAStringSet>
1        99         36M       NA        NA        NA CACTAGTGGC...GTTTAACTCG
2        99         35M       NA        NA        NA CTAGTGGCTC...TTTAACTCGT
3        99         35M       NA        NA        NA AGTGGCTCAT...TAACTCGTCC
                     qual   groupid mate_status
           <PhredQuality> <integer>    <factor>
1 <<<<<<<<<<...<<<<<;:<;7         0          NA
2 <<<<<<<<<<...9<<3/:<6):         0          NA
3 <<<<<<<<<<...<3;);3*8/5         0          NA
[1] "/tmp/RtmpOySFZo/file46ce6598a760"

Probably a better approach is to use formal tags (rather than munging qnames) to annotate your BAM file (you would do this 'upstream', when you're doing alignments or otherwise, and then use the tagFilter argument to ScanBamParam() to selectively input the data rather than making yet another set of BAM files.