Hi Everyone
I have two questions related to Rsamtools :
1. I am trying to filter my bam file by scanning the genome in 1kb bins, and removing reads that match certain criteria. But I will have to bin the genome chromosome by chromosome and run the filtering per chromosome..
I created the following example using the example BAM file , although it's a bit buggy due to NA's.
Suppose I want to only keep reads with highest mapQ per bin.
library(Rsamtools)
bamFile <- system.file("extdata", "ex1.bam", package="Rsamtools")
sparam <- ScanBamParam(what = c("qname", "flag", "rname", "pos", "mapq"),
flag = scanBamFlag(isFirstMateRead = TRUE, isUnmappedQuery = FALSE) )
bam <- scanBam(bamFile, param = sparam)[[1]]
bam <- as.data.frame(bam)
filterfunc <- function(bam) {
# remove NA positions
bam <- na.omit(bam)
# split df by chromosome
chroms <- as.character(unique(bam$rname))
bam2 <- split(bam, bam$rname)
# for each chrom, filter highest mapq in bins
filterHiQ <- function(bam){
chromEnd <- round(max(bam$pos), -2)
end <- ifelse(chromEnd < max(bam$pos), chromEnd + 100, chromEnd)
bins <- .bincode(bam$pos, seq(0,end, 100))
tokeep <- lapply(split(bam$mapq, bins), function(x) {
keep <- x == max(x)
return(keep)
})
return(unlist(tokeep))
}
# get filtered bam per chrom
keep <- unlist(lapply(bam2, filterHiQ))
return(keep)
}
This function works on bamFile that I read (returns logical vector), but when I try to create the FilterRule using this function, this returns errors due to NAs removed.
filterBam(bamFile, "/tmp/test.bam" , filter = FilterRules(list(filterfunc))) Error in seq.default(0, end, 100) : 'to' cannot be NA, NaN or infinite
I got this problem while trying to create an example. However, my question was different.! I wanted to know if there is a way to apply a function (like the filterHiQ i wrote above) under filterRules in such a way that It does it chromosome by chromosome, so I don't have to split the file internally?
2. My second question was related to the ScanBamParam above. I wanted to actually include R1 reads + R2 reads in case R1 is unmapped. So something like isFirstMateRead = TRUE | hasUnmappedMate = TRUE, but can't find how to do that..
Thanks for your help.
Vivek
P.S. (29th Aug 2016) I just edited the example for clarity. this works if I read the bam file as data frame and run the command. But doesn't work as filterRules
