Question

Rsamtools filterBam question

1

Entering edit mode

Vivek.b ▴ 100

@vivekb-7661

Last seen 5.7 years ago

Germany

Hi Everyone

I have two questions related to Rsamtools :

1. I am trying to filter my bam file by scanning the genome in 1kb bins, and removing reads that match certain criteria. But I will have to bin the genome chromosome by chromosome and run the filtering per chromosome..

I created the following example using the example BAM file , although it's a bit buggy due to NA's.

Suppose I want to only keep reads with highest mapQ per bin.

library(Rsamtools)

bamFile <- system.file("extdata", "ex1.bam", package="Rsamtools")

sparam <- ScanBamParam(what = c("qname", "flag", "rname", "pos", "mapq"),
                       flag = scanBamFlag(isFirstMateRead = TRUE, isUnmappedQuery = FALSE) )

bam <- scanBam(bamFile, param = sparam)[[1]]

bam <- as.data.frame(bam)

filterfunc <- function(bam) {
        # remove NA positions
        bam <- na.omit(bam)        
        # split df by chromosome
        chroms <- as.character(unique(bam$rname))
        bam2 <- split(bam, bam$rname)
        
     
        # for each chrom, filter highest mapq in bins
        filterHiQ <- function(bam){
                chromEnd <- round(max(bam$pos), -2)
                end <- ifelse(chromEnd < max(bam$pos), chromEnd + 100, chromEnd)
                bins <- .bincode(bam$pos, seq(0,end, 100))
                
                
                tokeep <- lapply(split(bam$mapq, bins), function(x) {
                        keep <- x == max(x)
                        return(keep)
                })
                
                return(unlist(tokeep))
        }
        
        # get filtered bam per chrom
        keep <- unlist(lapply(bam2, filterHiQ))
        return(keep)
        
}

This function works on bamFile that I read (returns logical vector), but when I try to create the FilterRule using this function, this returns errors due to NAs removed.

filterBam(bamFile, "/tmp/test.bam" , filter = FilterRules(list(filterfunc)))

Error in seq.default(0, end, 100) : 'to' cannot be NA, NaN or infinite

I got this problem while trying to create an example. However, my question was different.! I wanted to know if there is a way to apply a function (like the filterHiQ i wrote above) under filterRules in such a way that It does it chromosome by chromosome, so I don't have to split the file internally?

2. My second question was related to the ScanBamParam above. I wanted to actually include R1 reads + R2 reads in case R1 is unmapped. So something like isFirstMateRead = TRUE | hasUnmappedMate = TRUE, but can't find how to do that..

Thanks for your help.

Vivek

P.S. (29th Aug 2016) I just edited the example for clarity. this works if I read the bam file as data frame and run the command. But doesn't work as filterRules

rsamtools • 1.6k views

ADD COMMENT • link 9.4 years ago Vivek.b ▴ 100