DESeq2 with 2 factors: problem to fit the negative binomial and weird p-value distributions
1
0
Entering edit mode
AurelieMLB ▴ 50
@aureliemlb-6978
Last seen 7.2 years ago
United Kingdom

Hello,

I have RNA-seq data from 27 samples to analyse. I have 2 differents treatments (Compound1 and compound 2) + control (DMSO) and I have 3 times. I have 3 replicates for each combination.

I am trying to use DESeq2 to find the differentially expressed genes. First, may I double check with you that the formula looks OK: ~treatment*time ?? (I have fount ~cell + dex on http://www.bioconductor.org/help/workflows/rnaseqGene/ and I am confused. I am wondering why you would use this formula. Is there something special I should now about DESeq2 formulas?)

So, using a pooled analysis with all the samples at once and the design ~treatment*time:

- I tried to plot the dispersion versus the mean normalized counts (using plotDispEsts(dds)). I am not sure what is the red line in this plot because I thought the parameters were calculated per gene using the replicates... But for the low mean normalized counts (where the dispersion is bigger), this red line is well above the black and shrinked blue points. Is this the sign of a problem?

-I also looked at the p-value plots for different comparisons and some of them are not flat at all. The pvalues seem to slowly increase for the low p-values to the high pvalues with even some spike at 0.4 or 0.8. Is this classical for DESeq2 or is this the sign of a problem?

I thought that there weird observations were coming from a high variability between times so I tried to do a similar analysis but per time (formula: ~treatment ). My problem is that doing this I obtain very different lists of differentially expressed genes...

I am not sure which one is the best analysis really... any suggestion please?

Many thanks

deseq2 • 2.9k views
2
Entering edit mode
@mikelove
Last seen 1 day ago
United States

The design formula is very important, it species the terms which the model will include (these are written beta in the methods papers). A design with ~ a + b controls for differences due to the levels of the 'a' factor and the 'b' factor. Adding an interaction term (either with ~ a + b + a:b, or equivalently how you have written in ~ a*b) means that you want to fit a different treatment effect for every time. You do have replicates which would allow you to fit an interaction design. Notice that we have a time series example below in the workflow you link to, and this time series has an interaction term.

"I tried to plot the dispersion versus the mean normalized counts (using plotDispEsts(dds)). I am not sure what is the red line in this plot because I thought the parameters were calculated per gene using the replicates."

Check the description of our methods in the paper and read the help for ?plotDispEsts and ?estimateDispersions. While we calculate a final dispersion value for each gene, an intermediate step involves finding the mean-dispersion trend line.

It's not a problem if the red line is above the black and blue points at the far left side of the plot. This just means that the parametric curve climbs steeply there. These genes have insufficient power to detect difference, because these genes typically have e.g. a single count for one sample and zeros for the rest.

"I also looked at the p-value plots for different comparisons and some of them are not flat at all."

The p-value distribution should be flat if the null is true. If you have a mix of null and non-null hypothesis, and those have varying power the histogram can have different shapes. You might filter out low count genes as well when looking at the p-value distribution, because the discreteness of low counts produces spikes in the histogram: hist(res$pvalue[res$baseMean > 10])

"I tried to do a similar analysis but per time (formula: ~treatment ). My problem is that doing this I obtain very different lists of differentially expressed genes"

This is expected that you obtain very different list of DEG because this is an entirely different statistical test. This design includes no information about the treatment, so this is probably not the test that you want to perform.

I'd recommend you look at the time series part of the workflow (at the bottom of the link you have above) for an example.

0
Entering edit mode

Thank you very much for your help!!

0
Entering edit mode

The treatments are expected to induce big changes in the samples. So I am expecting to see quite a lot of genes differentially expressed at T24. My understanding is that DESeq2 assumes that the expression of most of genes do not change... Should I treat T1 and T4 in a pooled analysis with a ~treatment*time formula and deal with T24 separately? Is there a way to softer this assumption that the expression of most genes is not changed please?

0
Entering edit mode

It doesn't assume that most genes do not change. It assumes that the median of ratios of a sample to the pseudosample created over all samples accurately captures the sequencing depth.

To check the assumption, just look at a scatter plot of counts(dds, normalized=TRUE) for two samples, and see if the y=x line goes through the middle of the cloud.

You can also use 'controlGenes' argument of estimateSizeFactors if you can supply the genes which you know are likely not changing.

The place this breaks down is that, if you expect that *all* genes are upregulated by some amount for example, then you need to use external reference to guide the comparison, like spike-ins.

0
Entering edit mode

Thank you very much ! This is helpful!

0
Entering edit mode

Just wanted to point out that the RNA-seq tutorial you guys are talking about here is a great resource that you folks put together, thank you for doing that.

I see that it is referenced in the DESeq2 vignette included with the package, but I wonder if there's anyway to jimmy the site to get it linked to under the Documentation section of the DESeq2 package page, as well ...