Question: Counting exonic and intronic reads for differential gene expression
gravatar for Mthabisi Moyo
3 months ago by
Mthabisi Moyo10 wrote:

I have always assumed that reads mapping to exons are used as the input for differential gene expression analysis in DESeq2 (and other DGE analysis packages) primarily because poly(A) capture protocols are favored over total RNA prep protocols, resulting in a majority of sequenced reads being exonic. Given a few papers that now suggest that intronic reads are not DNA contaminants or other technical artifacts (Gaidatzis et. al, Nature Biotechnology 2015, Ameur et. al, Nat Struct Mol Biol, 2011), is there a reason why counting only exonic reads is still the recommended approach for differential gene expression? This becomes more relevant if libraries are  prepared using total RNA isolation protocols.

I have taken a total RNA data set and performed differential gene expression analysis with DESeq2 using two different feature counting approaches: 1) counting exons by gene and 2) counting all reads (intronic and exonic) within a gene. There is a high degree of correlation between the two results. Is there a reason to exclude intronic reads from the feature counting?


ADD COMMENTlink modified 3 months ago by Malcolm Cook1.5k • written 3 months ago by Mthabisi Moyo10
gravatar for Malcolm Cook
3 months ago by
Malcolm Cook1.5k
United States
Malcolm Cook1.5k wrote:


In short, the biochemistry and bioinformatics that you perform should both be aligned with the scientific questions you seek to ask.

The degree to which RNA-Seq can be taken as an indirect or proxy measure of protein abundance has been the subject of lots of research.

In a particular study, if RNA-Seq is being used as a proxy measure of "gene expression", which itself is being understood as a proxy measure of "protein abundance", then enriching for "mature" mRNA (= spliced & ready for translation into protein) by any means (e.g. polyA-enrichment) is a means of boosting your indirect readout of protein abundance.

In such a study, intronic reads are perforce taken to result from either:

  • gDNA contamination
  • artifact of mapping read to genome/transcriptome (incorrect, multiple, etc)
  • readout from (un-annotated) nested genes
  • pre-mRNA (e.g. escaping mature mRNA enrichment)
  • intron retention likely to introduce a premature stop codon leading to nonsense mediated decay
  • recursive splicing
  • or some other artifact which is not aim of the study.

However, if studying regulatory aspects of RNA are the aim of the study, then interrogating intronic reads can be informative.  In particular, the papers you site make observations about the timing of splicing w.r.t. transcription, alternative splicing, and other aspects of transcriptional control:

  • "the pattern of intronic sequence read coverage is explained by nascent transcription in combination with co-transcriptional splicing"
  • "intronic levels are a proxy for nascent transcription"
  • "comparison of exonic and intronic expression changes can separate transcriptional and post-transcriptional effects"

The reason to exclude introns is (roughly) that you are using gene expression analysis of mRNA-Seq data as a proxy for protein abundance.

Your finding a correlation between DESeq results with v without dropping intronic reads does not take aim at the research questions of your cited papers.  If you instead determined genes where the relative abundance of intronic to exonic reads (consistently) changed between experimental conditions for selected (classes of) genes, you might instead have something to say about the relationship of those conditions and, say, co-transcriptional splicing or intron retention or transcriptional elongation rates or ...

ADD COMMENTlink modified 3 months ago • written 3 months ago by Malcolm Cook1.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 202 users visited in the last hour