Hi,
In single-end mode, does feature counts use the the template length (col9 in sam file) to check for an overlap with the feature, or the read length?
I have paired-end data with no unmated reads, I allow multimatchs, so I would have assume that feature count use the template length, my count in in single-end mode should be exactly twice than my count in paired-end, but it does not seems to be the case...
I would like to use the single end mode because my bam is quite big, and already sorted by coordinates, and I takes much more time to ask featureCounts to re-order the reads to use paire end mode (and I have a lot of bam to count, onto various features, so it is taking too long, and I cannot keep the bam sorted by names on my disk because of storage problems...).
Have you read the featureCounts webpage or paper? They both seem to answer your questions.
Hi James,
I did read it but was not sure to understand ; I re-read it after your message, does:
"A read is said to overlap a feature if at least one read base is found to overlap the feature. For paired-end data, a fragment (or template) is said to overlap a feature if any of the two reads from that fragment is found to overlap the feature" (webpage)
and
"featureCounts preforms precise read assignment by comparing mapping location of every base in the read or fragment with the genomic region spanned by each feature" (paper)
mean that it take into account the read himself, and not the fragment length? (now that you tell me the answer is there I guess it is so, but I was not sure before...sorry)
Hi Samuel,
It's just a logic problem, and I think you are over thinking it. Let's say you have paired end data, and it looks like this:
XXXXXXXXXXXXXX XXXXXXXXXXXX <- these are your pairs
******************* <- this is an exon
That read pair overlaps the exon of that gene, so if you are using them as paired ends, you get a single count. However, if you say they are single ends, you still get one count. Now consider
XXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXX <- paired reads
*********************** ***************** <- exons from same gene
In this case you get one count from paired-end mode, and two counts from single end mode.
XXXXXXXXXXXXXXXX XXXXXXXXXXXXXX <- paired reads
************* +++++++++++++ <- exons from two different genes
In this situation, counting in paired end mode gives you zero counts, but in single end mode gives you TWO counts, one for each gene.
XXXXXXXXXXXX XXXXXXXXXXXXXX <- paired reads
************* <- exon
What do we get here? Zero counts! Regardless of how you count, because the reads don't overlap the exon. You could argue that this should count as one count for this gene (and it would if you aligned against the entire extent of each gene, from 5' UTR to 3' UTR, rather than the exonic regions).
You would be surprised to see how many more counts you get if you summarize based on the overlap within the gene body, rather than just the exons - there is apparently lots of nascent mRNA floating about, that hasn't had the introns excised yet. The same is true for the Affy arrays that have intronic controls that are supposed to measure background - these controls have surprisingly high binding. But I'm getting off on a tangent here...
Hi James,
I do see your point and I do see why my question does not make sense for you :) I should have mention that I am not working with RNAseq data (ie counting reads on exons of genes) but with ChIPseq data, counting fragments of IP on enriched regions. That's why I am more interested in the fragment as one entity and not as two read mates paired together. Maybe I should rather transform my reads data as single fragments data and count those, that would solve the problem...
If you are doing ChIP-Seq, then I doubt you want to be using featureCounts anyway. Instead you should probably be using the csaw package.
Ho Thanks a lot, I did not know about this package!! I was using DESeq with counts on peaks (with an home made variant of the pipeline developed in the DiffBind package), but this look actually much better than my tinkering!
Your explanation makes it very clear how the counting is performed. I think though that the way the option is named and the way its description is worded, makes it not entirely clear what precisely is counted and what isn't. "count read pairs" could be reasonably inferred to mean mapped read pairs, and the user is left to guess what happens when only one partner of a pair maps to a feature. The description does not clear up this ambiguity. And in particular, when users whose goal is not differential expression, are trying to leverage the basic counting functionality, it's those use cases where the ambiguity of the option name/description is more problematic. My 2¢.