I am working with paired-end RNA-seq data and have aligned the reads to the reference genome (unstranded) using STAR. Now, I need to count the aligned reads using featureCounts
.
I am aware of the --countReadPairs
option in featureCounts
, but I'm unsure about its importance in the context of paired-end data. Specifically:
- What is the role of
--countReadPairs
when usingfeatureCounts
on paired-end data? - How does using or not using this option affect the downstream analysis, especially in terms of gene expression quantification?
- Which command should I use to ensure accurate read counting for paired-end data?
Here are the commands I'm considering:
featureCounts -T 14 -s 0 -p -t exon -g gene_id -a genome.gtf -o counts.txt *.bam
or
featureCounts -T 14 -s 0 -p --countReadPairs -t exon -g gene_id -a genome.gtf -o counts.txt *.bam
Any guidance on the proper usage and its implications would be greatly appreciated.
So it is necessary to use
--countReadPairs
when you used paired end data. Then what's the importance of-p
?I guess it's not clear then.
Do you understand the difference between reads and fragments? A fragment is a chunk of cDNA generated from a strand of mRNA, and if you are doing paired-end sequencing you will have (mostly) two reads per fragment.
If you use
-p
you are tellingfeatureCounts
that the input data are paired-end, and it will abort if the data are not paired. In addition, it will provide a count of the reads that overlap any exons of any genes, so if you have two reads from a pair and both overlap exons from the same gene, you get a count of two for that gene (from that pair of reads).I imagine it's more complicated than that though - if you have two reads that are properly paired and one overlaps an exon from gene X and the other overlaps an exon from gene Y, then you might get a single count for each gene, but to me that doesn't make sense, so maybe you get zero counts for that pair? Anyway, that's a bit off-topic, so let's carry on.
If you include
--countReadPairs
, then you are counting fragments instead of reads which is to say that any pair of reads that are properly paired and overlap the same exon (or two exons from the same gene) will be counted as a single fragment rather than as two reads. The idea being that you had a chunk of mRNA that you converted to cDNA and then fragmented, and now you have two reads, one from each end of the fragment, but that fragment only represents one mRNA transcript, so the pair of reads only represents one thing, so why would you count that as two reads?In other words, if you have properly paired reads for which only one read overlaps an exon you will count one read as aligning. But if you have properly paired reads where both reads overlap exons in a gene and you don't specify
--countReadPairs
, then you get a count of two for that gene, even though it was just one fragment.Do note that the default for
featureCounts
inRsubread
is to count fragments rather than reads.