Question

How to determine the cause of no difference between treatments in rnaseq?

0

Entering edit mode

jdmontenegroc • 0

@jdmontenegroc-21026

Last seen 4.8 years ago

Hello, I recently had a fairly straightforward dge analysis comparing two treatments with three biological replicates each. All samples were clones of the same plant and the tests were performed under the same growth conditions except for the treatment. DGE using the the glmQLFit and glmQLFTest commands resulted in 0 genes differentially expressed between the groups for a FDR threshold of 0.01.

I understand that the exact test will produce no genes with DE under two circumstances: 1) there is too much variation in the expression levels within groups and therefore the test estimates a high p.value which translates into no DGE and 2) both groups have actually very similar expression patterns even though the dispersion of the data is not too big.

How can I test which one of the above is the main reason driving the result of no differential genes? Is there any other explanation for that result that I have not considered?

Thank you for your help.

Regards,

Juan D. Montenegro

RNAseq DGE edger • 2.2k views

ADD COMMENT • link updated 4.9 years ago by Gordon Smyth 50k • written 4.9 years ago by jdmontenegroc • 0

1

Entering edit mode

Note that an FDR of .01 is quite conservative, going with an FDR <= 0.10 is not uncommon so take a look at that.

As Aaron points out though, there are many reasons for not detecting any DE genes. Maybe the effect of the treatment isnt vere big at all, in which case more replicates will (always) help ... Or maybe there's some variability / technical artifact that you aren't accounting for ...

ADD REPLY • link 4.9 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Thank you Steve,

Indeed going through the experimental design I found out the "clones" were not actually clones, but plants from the same "variety" that show fairly similar behaviour and phenotypes, but are not consistent one generation after the next, so yes, there was some variability between the repeats that I was not taking into account, but then again, I cannot add that factor to the model matrix, unless I can do this:

ind<-factor(c(1, 2, 3, 4, 5, 6))
treat<-factor(c(1, 1, 1, 2, 2, 2))
design<-mode.matrix(~ind+treat)

would that make sense at all?

Cheers,

Juan Daniel

ADD REPLY • link 4.8 years ago jdmontenegroc • 0

0

Entering edit mode

No, unfortunately it wouldn't make sense to encode the design in this way as there's no replication within your ind factor.

ADD REPLY • link 4.8 years ago Steve Lianoglou ★ 13k

score 1 · Answer 1 · 2019-06-14

It's all relative. Large dispersions aren't a problem if your effect sizes are even larger. There's not really any absolute distinction between the two possibilities, other than some rules of thumb based on prior experience.

If you're talking about RNA-seq in (nominally) controlled lab conditions, I would consider dispersions above 0.1 to be quite high. I might then say, "Oh, that's pretty high. Better go have a closer look at the data to see if there's outliers/batch effects/etc. that might be inflating the estimated variance and masking any effect." If I have dispersions below 0.01, then I might say, "Well, that's pretty low, so if I'm not getting any DE genes, there's probably not a strong effect in the first place."

Of course, detection power is also influenced by other factors such as the number of samples, the magnitude of the counts and the number of genes (due to the multiple testing correction). If you're not getting any DE genes with 100 deeply sequenced samples that exhibit Poisson-level variation (i.e., dispersions of zero), then there's probably no differential expression of any biological relevance.

If you want to do some testing, just take your estimated dispersion and the average count for each gene, introduce a known log-fold change, sample counts from a negative binomial distribution and see how many genes you detect. I believe that there are a couple of Bioconductor packages to do this kind of empirical power calculation for you, though I have not used them personally.

score 1 · Answer 2 · 2019-06-15

1

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

The secret is to plot the data, as illustrated in the edgeR QL workflow article:

Use plotMDS to see if your samples separate between treatments (as in Figure 2 of the workflow). In my experience, when people fail to find any DE but believe they should have, the reason is most commonly because of quality control issues that are evident from the MDS plot.
Use plotBCV to plot the dispersions (Figure 5 of the workflow). This shows you exactly how large or small the dispersions are.
Use plotMD to plot the fold-changes (Figure 7 of the workflow). This shows you exactly how large the fold changes are.
Look at the top 10-30 genes in the topTags table. Are the reported fold changes (logFC) large or small?

ADD COMMENT • link 4.9 years ago Gordon Smyth 50k

0

Entering edit mode

Dear Gordon,

Thank you for your answer. I did perform quality checks on the data before doing the DGE test. After filtering genes with fewer than 10 alignments in fewer than 3 libraries I checked the BCV and MDS plots after normalization. The BCVplot shows quite a bit of variation (common dispersion = 0.23). The MDSplot shows two groups: the repeats in one are quite close to each other, while the other group is spread mostly across the X axis.

I am not sure if the high dispersion of the second group would be enough to skew the dispersion estimates of both groups. The logFC of the top 20 genes in the table is 3 (so, 8-fold change in read counts between groups), so not so much.

ADD REPLY • link 4.8 years ago jdmontenegroc • 0