Question

Counting exonic and intronic reads for differential gene expression

0

Entering edit mode

Mthabisi Moyo ▴ 40

@mthabisi-moyo-9721

Last seen 13 months ago

United States

I have always assumed that reads mapping to exons are used as the input for differential gene expression analysis in DESeq2 (and other DGE analysis packages) primarily because poly(A) capture protocols are favored over total RNA prep protocols, resulting in a majority of sequenced reads being exonic. Given a few papers that now suggest that intronic reads are not DNA contaminants or other technical artifacts (Gaidatzis et. al, Nature Biotechnology 2015, Ameur et. al, Nat Struct Mol Biol, 2011), is there a reason why counting only exonic reads is still the recommended approach for differential gene expression? This becomes more relevant if libraries are prepared using total RNA isolation protocols.

I have taken a total RNA data set and performed differential gene expression analysis with DESeq2 using two different feature counting approaches: 1) counting exons by gene and 2) counting all reads (intronic and exonic) within a gene. There is a high degree of correlation between the two results. Is there a reason to exclude intronic reads from the feature counting?

deseq2 intron • 2.5k views

ADD COMMENT • link updated 7.4 years ago by Malcolm Cook ★ 1.6k • written 7.4 years ago by Mthabisi Moyo ▴ 40

score 0 · Answer 1 · 2018-09-16

In short, the biochemistry and bioinformatics that you perform should both be aligned with the scientific questions you seek to ask.

The degree to which RNA-Seq can be taken as an indirect or proxy measure of protein abundance has been the subject of lots of research.

In a particular study, if RNA-Seq is being used as a proxy measure of "gene expression", which itself is being understood as a proxy measure of "protein abundance", then enriching for "mature" mRNA (= spliced & ready for translation into protein) by any means (e.g. polyA-enrichment) is a means of boosting your indirect readout of protein abundance.

In such a study, intronic reads are perforce taken to result from either:

gDNA contamination
artifact of mapping read to genome/transcriptome (incorrect, multiple, etc)
readout from (un-annotated) nested genes
pre-mRNA (e.g. escaping mature mRNA enrichment)
intron retention likely to introduce a premature stop codon leading to nonsense mediated decay
recursive splicing
or some other artifact which is not aim of the study.

However, if studying regulatory aspects of RNA are the aim of the study, then interrogating intronic reads can be informative. In particular, the papers you site make observations about the timing of splicing w.r.t. transcription, alternative splicing, and other aspects of transcriptional control:

"the pattern of intronic sequence read coverage is explained by nascent transcription in combination with co-transcriptional splicing"
"intronic levels are a proxy for nascent transcription"
"comparison of exonic and intronic expression changes can separate transcriptional and post-transcriptional effects"

The reason to exclude introns is (roughly) that you are using gene expression analysis of mRNA-Seq data as a proxy for protein abundance.

Your finding a correlation between DESeq results with v without dropping intronic reads does not take aim at the research questions of your cited papers. If you instead determined genes where the relative abundance of intronic to exonic reads (consistently) changed between experimental conditions for selected (classes of) genes, you might instead have something to say about the relationship of those conditions and, say, co-transcriptional splicing or intron retention or transcriptional elongation rates or ...