Search
Question: Filtering a BAM file using RSamtools
1
2.7 years ago by
Vivek.b50
Germany
Vivek.b50 wrote:

Dear BioC team

I have a BAM file which I want to split into multiple BAM files, based on the "tags" that I have appended in the read names (so qname param) . I thought it might be possible using Rsamtools::filterBam()  function if I am able to provide the right FilterRules instance. But I am confused about : 1) whether it's even possible 2) how to do it?

For example if my tags are divided in 5 categories I can write 5 functions which return TRUE/FALSE given the string in qname field. Now, the Rsamtools manual says:

Functions in the FilterRules instance should expect a single DataFrame argument representing all information specified by param. Each function must return a logical vector, usually of length equal to the number of rows of the DataFrame. Return values are used to include (when TRUE) corresponding records in the filtered BAM file.

Now I don't know how to fit my functions in this kind of structure.. Any help will be appreciated.

Thanks

modified 2.7 years ago by Martin Morgan ♦♦ 22k • written 2.7 years ago by Vivek.b50
5
2.7 years ago by
Martin Morgan ♦♦ 22k
United States
Martin Morgan ♦♦ 22k wrote:

By way of getting a reproducible example, I ran

library(Rsamtools)
example(filterBam)
qnames <- scanBam(fl, param=ScanBamParam(what="qname"))[[1]][[1]]
table(vapply(strsplit(qnames, "_"), "[[", character(1), 1))

with the output

> table(vapply(strsplit(qnames, "_"), "[[", character(1), 1))

B7   EAS1 EAS112 EAS114 EAS139 EAS188 EAS192 EAS218 EAS219 EAS220 EAS221
467    505     61    534    209    107     68     82     91     51    130
EAS51  EAS54  EAS56
222    355    425 

Suppose I'm interested in the reads starting with 'B7'. I'd use ScanBamParam(what="qname") to extract minimal information from the BAM file for the filter. This would give my filter a DataFrame with a single column, qname, and my filter's task would be to return a logical vector indicating which rows pass the filter, so maybe grepl("^B7_", df$qname) or in R-devel the new-fangled startsWith(df$qname, "B7_"). So...

dest <- tempfile()
filter <- FilterRules(list(iWantB7=function(x) startsWith(xqname, "B7"))) filterBam(fl, dest, filter=filter, param=ScanBamParam(what="qname"))  and... > qnames <- scanBam(dest, param=ScanBamParam(what="qname"))[[1]][[1]] > table(vapply(strsplit(qnames, "_"), "[[", character(1), 1)) B7 467  I worked through this by consulting the help page (also for FilterRules) and getting off the ground with a basic filter filter <- FilterRules(list(iWantB7=function(x) { print(head(x, 3)) rep(FALSE, nrow(x)) })) and > filterBam(fl, dest, filter=filter) DataFrame with 3 rows and 15 columns qname flag rname strand pos qwidth <character> <integer> <factor> <factor> <integer> <integer> 1 B7_591:4:96:693:509 73 seq1 + 1 36 2 EAS54_65:7:152:368:113 73 seq1 + 3 35 3 EAS51_64:8:5:734:57 137 seq1 + 5 35 mapq cigar mrnm mpos isize seq <integer> <character> <factor> <integer> <integer> <DNAStringSet> 1 99 36M NA NA NA CACTAGTGGC...GTTTAACTCG 2 99 35M NA NA NA CTAGTGGCTC...TTTAACTCGT 3 99 35M NA NA NA AGTGGCTCAT...TAACTCGTCC qual groupid mate_status <PhredQuality> <integer> <factor> 1 <<<<<<<<<<...<<<<<;:<;7 0 NA 2 <<<<<<<<<<...9<<3/:<6): 0 NA 3 <<<<<<<<<<...<3;);3*8/5 0 NA [1] "/tmp/RtmpOySFZo/file46ce6598a760" Probably a better approach is to use formal tags (rather than munging qnames) to annotate your BAM file (you would do this 'upstream', when you're doing alignments or otherwise, and then use the tagFilter argument to ScanBamParam() to selectively input the data rather than making yet another set of BAM files. ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Martin Morgan ♦♦ 22k Thanks a lot Dr. Morgan for a fast and helpful reply. Actually I add these tags in my raw fastq files for demultiplexing the samples post-alignment (my fastq files are multiplexed using a novel multiplexing index). I was exploring the idea that demultiplexing after alignment might be more convenient than demultiplexing raw fastq files, followed by alignment. The solution you suggested worked nicely. I was wondering how I can make it more efficient to filter a large bam file into 3-4 bamFiles this way.. Thanks again Vivek ADD REPLYlink written 2.7 years ago by Vivek.b50 in release the only way to do this is with multiple passes through the file, perhaps using BiocParallel::bplapply as destinations <- replicate(2, tempfile()) filters <- list( FilterRules(list(iWantB7 = function(x) startsWith(xqname, "B7_"))),
FilterRules(list(iWantEAS1=function(x) startsWith(x\$qname, "EAS1_")))
)
bplapply(seq_along(destinations), function(i, file, destinations, filters) {
filterBam(file, destinations[i], filter=filters[[i]])
}, fl, destinations, filters)

I added a more efficient implementation of this to the 'devel' version of Rsamtools, version 1.23.2; this requires use of the devel version of Bioconductor, and should be available via biocLite() after the builds complete tomorrow afternoon, January 15, with

filterBam(fl, destinations, filter=filters)