Question

What is the importance of `--countReadPairs` in featureCounts on paired end data?

0

Entering edit mode

Bioinformagician • 0

@2c913ccf

Last seen 11 months ago

India

I am working with paired-end RNA-seq data and have aligned the reads to the reference genome (unstranded) using STAR. Now, I need to count the aligned reads using featureCounts.

I am aware of the --countReadPairs option in featureCounts, but I'm unsure about its importance in the context of paired-end data. Specifically:

What is the role of --countReadPairs when using featureCounts on paired-end data?
How does using or not using this option affect the downstream analysis, especially in terms of gene expression quantification?
Which command should I use to ensure accurate read counting for paired-end data?

Here are the commands I'm considering:

featureCounts -T 14 -s 0 -p -t exon -g gene_id -a genome.gtf -o counts.txt *.bam

or

featureCounts -T 14 -s 0 -p --countReadPairs -t exon -g gene_id -a genome.gtf -o counts.txt *.bam

Any guidance on the proper usage and its implications would be greatly appreciated.

featureCounts DESeq2 • 4.1k views

ADD COMMENT • link updated 10 months ago by rohitsatyam102 ▴ 20 • written 15 months ago by Bioinformagician • 0

0

Entering edit mode

Since I don't know which organism you are working on, I would recommend using the following:

-s 0 if library is unstranded

-C Discard reads where R1 maps to one chromosome and R2 maps to another chromosome or vice-versa`

featureCounts -F GTF --countReadPairs -p -s 0 -C -T 20 -t exon -g gene_id -a genome.gtf -o Counts.txt *.bam

If your genome has gene duplication events (like I work with Plasmodium and Toxoplasma and even though 95-96% of reads are uniquely aligned I wish to account for multimapping reads that emnate from small gene family comprising of 60 genes) or for some reason you expect more multimapping reads (like in rice genome) and wanna account for them you can use some extra flags such as --fraction -M which will add 1/x to the read counts for every read that maps to x locations in the genome. The you can use round() function in R to get integer that DESeq2 loves. Maybe @james-w-macdonald-5106 can attest to this usage can you?

ADD REPLY • link 10 months ago rohitsatyam102 ▴ 20

score 4 · Accepted Answer · 2024-08-19

4

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 14 hours ago

United States

Is this part of the manual not clear?

Summarize paired-end reads and count fragments (instead of reads):

featureCounts -p --countReadPairs -t exon -g gene_id -a annotation.gtf -o counts.txt mapping_results_PE.bam

ADD COMMENT • link 15 months ago James W. MacDonald 68k

0

Entering edit mode

So it is necessary to use --countReadPairs when you used paired end data. Then what's the importance of -p?

ADD REPLY • link 15 months ago Bioinformagician • 0

2

Entering edit mode

I guess it's not clear then.

Do you understand the difference between reads and fragments? A fragment is a chunk of cDNA generated from a strand of mRNA, and if you are doing paired-end sequencing you will have (mostly) two reads per fragment.

If you use -p you are telling featureCounts that the input data are paired-end, and it will abort if the data are not paired. In addition, it will provide a count of the reads that overlap any exons of any genes, so if you have two reads from a pair and both overlap exons from the same gene, you get a count of two for that gene (from that pair of reads).

I imagine it's more complicated than that though - if you have two reads that are properly paired and one overlaps an exon from gene X and the other overlaps an exon from gene Y, then you might get a single count for each gene, but to me that doesn't make sense, so maybe you get zero counts for that pair? Anyway, that's a bit off-topic, so let's carry on.

If you include --countReadPairs, then you are counting fragments instead of reads which is to say that any pair of reads that are properly paired and overlap the same exon (or two exons from the same gene) will be counted as a single fragment rather than as two reads. The idea being that you had a chunk of mRNA that you converted to cDNA and then fragmented, and now you have two reads, one from each end of the fragment, but that fragment only represents one mRNA transcript, so the pair of reads only represents one thing, so why would you count that as two reads?

In other words, if you have properly paired reads for which only one read overlaps an exon you will count one read as aligning. But if you have properly paired reads where both reads overlap exons in a gene and you don't specify --countReadPairs, then you get a count of two for that gene, even though it was just one fragment.

Do note that the default for featureCounts in Rsubread is to count fragments rather than reads.

ADD REPLY • link 15 months ago James W. MacDonald 68k

0

Entering edit mode

Yes use this option. It is recommended for PE and if you don't use it you will observe nearly twice the number of counts in your results. HTSeq also count read pairs or as James said fragments (R1 and R2 as 1 rather than 2).

ADD REPLY • link 10 months ago rohitsatyam102 ▴ 20