High number of significant DE genes with low baseMean, is this abnormal?
1
0
Entering edit mode
dhibar ▴ 60
@dhibar-10079
Last seen 7.6 years ago

I have a relatively large RNAseq data set: 360 Human blood biological replicates, 100bp PE reads with 60M read depth on average, with Poly-A removal.

I'm analyzing these data with DESeq2 following these basic steps:

dds = DESeqDataSetFromMatrix(countData = dat0, colData = colDat, design = ~ Age + Sex + RIN + pH + Treatment)
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <-  nbinomWaldTest(dds)
ddsRes <- results(dds, pAdjustMethod = "BH")

However, many of the most significant (including the top 6 results) are small ncRNA with relatively low baseMean counts (<10). Below I am attaching a plot of the p-values versus baseMean distribution, does this look abnormal? It certainly looks different than the ones presented in the DESeq2 tutorial found here: https://www.bioconductor.org/help/course-materials/2015/LearnBioconductorFeb2015/B02.1.1_RNASeqLab.html#diagnostic (under the Independent Filtering section).

deseq2 • 3.6k views
ADD COMMENT
1
Entering edit mode
@ryan-c-thompson-5618
Last seen 7 weeks ago
Icahn School of Medicine at Mount Sinai…

Because you have both many samples and very high coverage depth per sample, you should have much more statistical power at low abundances than a typical RNA-seq experiment. The high sequencing depth means that a given baseMean value, which is on the CPM scale, corresponds to a higher count, and the high replication gives you more statistical power at any given count level. So I don't think your graph is cause for concern. There is still a drop-off near the low-end, but the drop-off happens at a much lower baseMean value than in the tutorial, which is what you'd expect with a lower noise floor.

(By the way, is this dataset publicly available? It sounds like it would be a great dataset to play with.)

ADD COMMENT
0
Entering edit mode

I agree with Ryan. Power goes up substantially when you have 100s of replicates. That's the difference. A base mean of 10 could be 100 samples with a mean of ~0 and 100 samples with a mean of ~20. An example like this should reject the null. 

One small note: the base mean is on the log normalized counts scale, not CPM. "Normalized counts" in DESeq2 are normalized to the depth of a sample with median sequencing depth (what Simon often calls "common scale"). But this doesn't change your message.

ADD REPLY
0
Entering edit mode

Thanks Michael and Ryan. The normalized counts for the top DE gene looks like several of the subjects have a count=0 so I'm wondering if the NB model used in DESeq2 provides accurate estimates in cases like that. I wonder too about the biological significance, if it is significantly DE in the model but it is the case the counts for many subjects are 0 is it likely just an artifact? That second part isn't necessarily a question for you, but something I need to investigate further.

ADD REPLY
1
Entering edit mode

The negative binomial distribution is a discrete distribution that ranges over every possible count value, i.e. all non-negative integers, including zero. For low-abundance genes, a certain proportion of zero counts is expected and normal, and is accounted for in the model without any additional adjustment. If you suspect that there are an excess of zero counts relative to what the NB would predict, you could try a package like ShrinkBayes, which can fit a "zero-inflated negative binomial" model. (I haven't used the package myself, I just know it exists.)

ADD REPLY
0
Entering edit mode

Again, I agree with Ryan. There is absolutely no problem for the Negative Binomial to accommodate zeros when the other values of the samples in the group are not so high. For example:

plot(0:40, dnbinom(0:40, mu=10, size=2), type="b", ylim=c(0,.1))

or

plot(0:40, dnbinom(0:40, mu=20, size=2), type="b", ylim=c(0,.1))

Furthermore, DESeq2 automatically handles individual count outliers: if a few number of sample within a group have very large count, e.g. 10,000, while most of the other samples have very low values. See DESeq2 paper or vignette.

ADD REPLY

Login before adding your answer.

Traffic: 838 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6