Question

High number of significant DE genes with low baseMean, is this abnormal?

0

Entering edit mode

dhibar ▴ 60

@dhibar-10079

Last seen 7.0 years ago

I have a relatively large RNAseq data set: 360 Human blood biological replicates, 100bp PE reads with 60M read depth on average, with Poly-A removal.

I'm analyzing these data with DESeq2 following these basic steps:

dds = DESeqDataSetFromMatrix(countData = dat0, colData = colDat, design = ~ Age + Sex + RIN + pH + Treatment) dds <- estimateSizeFactors(dds) dds <- estimateDispersions(dds) dds <- nbinomWaldTest(dds) ddsRes <- results(dds, pAdjustMethod = "BH")

However, many of the most significant (including the top 6 results) are small ncRNA with relatively low baseMean counts (<10). Below I am attaching a plot of the p-values versus baseMean distribution, does this look abnormal? It certainly looks different than the ones presented in the DESeq2 tutorial found here: https://www.bioconductor.org/help/course-materials/2015/LearnBioconductorFeb2015/B02.1.1_RNASeqLab.html#diagnostic (under the Independent Filtering section).

deseq2 • 3.3k views

ADD COMMENT • link updated 8.0 years ago by Ryan C. Thompson ★ 7.9k • written 8.0 years ago by dhibar ▴ 60

score 1 · Accepted Answer · 2016-04-21

1

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

Because you have both many samples and very high coverage depth per sample, you should have much more statistical power at low abundances than a typical RNA-seq experiment. The high sequencing depth means that a given baseMean value, which is on the CPM scale, corresponds to a higher count, and the high replication gives you more statistical power at any given count level. So I don't think your graph is cause for concern. There is still a drop-off near the low-end, but the drop-off happens at a much lower baseMean value than in the tutorial, which is what you'd expect with a lower noise floor.

(By the way, is this dataset publicly available? It sounds like it would be a great dataset to play with.)

ADD COMMENT • link 8.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

I agree with Ryan. Power goes up substantially when you have 100s of replicates. That's the difference. A base mean of 10 could be 100 samples with a mean of ~0 and 100 samples with a mean of ~20. An example like this should reject the null.

One small note: the base mean is on the log normalized counts scale, not CPM. "Normalized counts" in DESeq2 are normalized to the depth of a sample with median sequencing depth (what Simon often calls "common scale"). But this doesn't change your message.

ADD REPLY • link 8.0 years ago Michael Love 41k

0

Entering edit mode

Thanks Michael and Ryan. The normalized counts for the top DE gene looks like several of the subjects have a count=0 so I'm wondering if the NB model used in DESeq2 provides accurate estimates in cases like that. I wonder too about the biological significance, if it is significantly DE in the model but it is the case the counts for many subjects are 0 is it likely just an artifact? That second part isn't necessarily a question for you, but something I need to investigate further.

ADD REPLY • link 8.0 years ago dhibar ▴ 60

1

Entering edit mode

The negative binomial distribution is a discrete distribution that ranges over every possible count value, i.e. all non-negative integers, including zero. For low-abundance genes, a certain proportion of zero counts is expected and normal, and is accounted for in the model without any additional adjustment. If you suspect that there are an excess of zero counts relative to what the NB would predict, you could try a package like ShrinkBayes, which can fit a "zero-inflated negative binomial" model. (I haven't used the package myself, I just know it exists.)

ADD REPLY • link 8.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Again, I agree with Ryan. There is absolutely no problem for the Negative Binomial to accommodate zeros when the other values of the samples in the group are not so high. For example:

plot(0:40, dnbinom(0:40, mu=10, size=2), type="b", ylim=c(0,.1))

or

plot(0:40, dnbinom(0:40, mu=20, size=2), type="b", ylim=c(0,.1))

Furthermore, DESeq2 automatically handles individual count outliers: if a few number of sample within a group have very large count, e.g. 10,000, while most of the other samples have very low values. See DESeq2 paper or vignette.

ADD REPLY • link 8.0 years ago Michael Love 41k