Can featurecounts count number of mapped reads rapidly in arbitrary regions?
1
0
Entering edit mode
Last seen 5.3 years ago
Zurich, Switzerland

Dear All,

I am interested in counting the number of mapped reads in equal size intervals on genome (say 200 bp), my question is whether featurecounts any option for doing this. According to what I have found so far it sounds like it is more appropriate for counting reads mapping to exons, genes, and promoters. In order to use it for this purpose I define a set of features like following:

library("BSgenome.Hsapiens.UCSC.hg19")
lengths = seqlengths(Hsapiens)
chrlengths = lengths[1:24]
intervalnum <- ceiling(chrlengths/200)

start <- rapply(as.list(intervalnum),function(x) seq(1,x*200,200))
end <- rapply(as.list(chrlengths),function(x) c(as.list(seq(200,x,200)),x))
chr <- rep(names(intervalnum),intervalnum)
ann <- data.frame(
GeneID=as.character(1:sum(intervalnum)),
Chr=chr,
Start=start,
End=end,
Strand = rep("+",sum(intervalnum)),
stringsAsFactors=FALSE)

bams <- dir(".","_sorted.bam\$", full=TRUE)

Is there a better and faster way? (this way, it takes too long for each bam file)

0
Entering edit mode

How long did it take for each of your bam files?

0
Entering edit mode

Hi Wei,

I could run it more rapidly with more cores thanks to parallel feature in featurecounts.

0
Entering edit mode

We ran your code to generate 15 million bins and then ran featureCounts to assign a bam file including 11 million reads to the bins. FeatureCounts successfully completed the read counting in just 4 minutes.

Can you provide your session info and also featureCounts screen output?

0
Entering edit mode

Hi Wei,

Code above is the modified version of code I first used with featurecounts. Yes, it is fast enough with 5 cores. I have the latest version of Rsubread. I was not sure whether I was using featurecounts correctly, but with these responses I could modify the code and run it successfully. Thank you so much.

3
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 10 hours ago
The city by the bay

An alternative might be to use windowCounts in the csaw package. You can tinker around with the width, spacing and ext parameters to get it to do what you want; in your case, it seems that values of 200 bp would be appropriate for the first two parameters, and 150 bp for ext. I don't think it would be appreciably faster than featureCounts though, especially since it isn't parallelized; but is speed really a problem? Run it once, leave it for 10-20 minutes depending on your computer's performance/number of BAM files/number of reads, and save the count matrix (from featureCounts) or the SummarizedExperiment object (from windowCounts) to file for future use. It's not like you have to run it every time you open up your data set.

P.S. featureCounts will count whatever features you give it, so it's completely valid to give it fixed intervals. However, if you do so, note that reads spanning the interval boundaries will not be assigned to either interval by default; you'll have to set allowMultiOverlaps=TRUE for such reads to be counted, which is especially important for small intervals. This isn't a problem for gene-based features as genes are usually sparsely distributed (in mammalian genomes, at least).