Question

RNA-Seq differential analysis with ballgown, getting more significantly deferentially expressed genes than significantly deferentially expressed transcripts

0

Entering edit mode

ryanding2003 • 0

@ryanding2003-15819

Last seen 7.7 years ago

Hi all,

I used Hisat2, Stringtie and Ballgown to perform differential analysis on my data. I followed the tutorial in the HISAT, StringTie, and Ballgown paper and everything went fine until I used Ballgown to extract genes and transcripts with q-value less than 0.05. I got 293 significantly differentially expressed genes but only 112 significantly differentially expressed transcripts, which doesn't make sense to me. I believe there should be more significantly differentially expressed transcripts than genes since one gene correspond to at least one transcript.

My question is: is it possible to have more differentially expressed genes than transcripts? Or maybe I made some mistakes in my analysis. Any thoughts are extremely valuable to me.

My data has two conditions and each condition has three biological replications.

I am totally new to RNA-Seq and differential analysis so I am sorry if my question is stupid.

Thank you in advance.

ballgown • 2.2k views

ADD COMMENT • link updated 7.7 years ago by James W. MacDonald 68k • written 7.7 years ago by ryanding2003 • 0

score 1 · Answer 1 · 2018-05-15

You should in general expect to have fewer differentially expressed transcripts (DET) than differentially expressed genes (DEG). There are a couple of reasons for this. The first and most obvious is that you are apportioning the available reads for a given gene to all of the transcripts that can arise from that gene. So as an example, say we have a gene with four transcripts, and you have 100 reads that align to the gene. When you count reads/transcript, if they are equally apportioned to each transcript you now have only 25 reads/transcript. As the number of reads/thing you care about goes down, the variance goes up, and as variance goes up, your ability to reliably detect differences is reduced.

So all things equal you should expect fewer DET than DEG. Another less obvious issue is the inherent variability in estimating transcript counts. If the read length were long enough, it would be easy to say what transcript you measured. However, most read lengths are far shorter than the transcript, so you have to probabilistically infer what transcript a read came from. That inference carries its own variability (you are never quite sure you accurately determined what transcript a read came from, whereas you have a higher confidence at the gene level), so you need to also account for the fact that your read counts have an additional source of variation that is probably much higher than what you have at the gene level. This increased variability also reduces your ability to detect DET as compared to DEG.