Question: DESEQ2 - help with interpreting time series heatmaps and resultsName
0
7 months ago by
courtney.stairs0 wrote:

Hi there,

I'm struggling a little bit with understanding how to interpret the heatmap output from the fission test data here:

http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#time-course-experiments

First, I would like to just confirm that I understand the differences in the Wald and LRT padj values.

• This heatmap (shown below) was generated by gathering the top 20 genes with the lowest padj values in the results from the whole time series. That is, these are the genes that at one or more time points showed differences between the two strains tests.

• However, if we want to know the p-values/padj for a given comparison, say differences between the strains at a given time point, we have to do another test strictly comparing those:

#just look at one gene to save space :)
results(ddsTC, name="strainmut.minute30", test="Wald")["SPBC2F12.09c",]
log2 fold change (MLE): strainmut.minute30
Wald test p-value: strainmut.minute30
DataFrame with 1 row and 6 columns
baseMean    log2FoldChange             lfcSE
<numeric>         <numeric>         <numeric>
SPBC2F12.09c 174.671161802578 -2.60046902875453 0.634342916342924
<numeric>            <numeric>        <numeric>
SPBC2F12.09c -4.09946885471126 4.14099389594956e-05 0.27993118736619

• This compares the mutant strain to the wildtype strain at minute 30 (i think?). So the LFC represented here are the mutant expression values compared to the wildtype. This will generate a completely different p-value/padj than the previous LRT because its a completely different test comparing different things?

IF, that's right, then let's keep going down this rabbit hole:

• What do the LFC in the heat map represent?

• For example, if we look at SPBC2F12.09c, in the first column of the heatmap (minute15vs_0) what samples are being compared? Is this all time 15 samples (mutant and wildtype) compared to all time 0 samples (mutant and wildtype)? That is, this gene was upregulated at 15 minutes compared to time 0. This would be generated by running:

res15 <- results(ddsTC, name="minute_15_vs_0", test="Wald")
res15["SPBC2F12.09c",]
log2 fold change (MLE): minute 15 vs 0
Wald test p-value: minute 15 vs 0
DataFrame with 1 row and 6 columns
baseMean   log2FoldChange             lfcSE
<numeric>        <numeric>         <numeric>
SPBC2F12.09c 174.671161802578 6.51883346980906 0.540737251817487
<numeric>            <numeric>            <numeric>
SPBC2F12.09c 12.0554547479361 1.81527817129386e-33 2.56031468032063e-31

• Then, if we look at strainmut.minute15 for that same gene, we see that it is blue (down-regulated). Does this mean that it was down-regulated in the mutants at time 15 compared to the wildtype? If so, would the padj value for this comparison be calculated as so:
results(ddsTC, name="strainmut.minute15", test="Wald")["SPBC2F12.09c",]
log2 fold change (MLE): strainmut.minute15
Wald test p-value: strainmut.minute15
DataFrame with 1 row and 6 columns
baseMean    log2FoldChange            lfcSE
<numeric>         <numeric>        <numeric>
SPBC2F12.09c 174.671161802578 -2.57413709070726 0.64079390784322
<numeric>            <numeric>         <numeric>
SPBC2F12.09c -4.01710606046689 5.89172116301028e-05 0.060423288283765

• This padj value is greater than 0.05, so with a FDR cutoff value of 0.05, would you consider this LFC significant? What if the LFC presented in the heat map do not have significant padj values from their respective pair-wise Wald tests? Can we still 'believe' the LFC?

I apologize for such a trivial question, but I would like to really fully understand this :) Thanks for your time!

Courtney

Heatmap of log2 fold changes for genes with smallest adjusted p value. The bottom set of genes show strong induction of expression for the baseline samples in minutes 15-60 (red boxes in the bottom left corner), but then have slight differences for the mutant strain (shown in the boxes in the bottom right corner).

deseq2 • 276 views
modified 7 months ago by Michael Love25k • written 7 months ago by courtney.stairs0
Answer: DESEQ2 - help with interpreting time series heatmaps and resultsName
0
7 months ago by
Michael Love25k
United States
Michael Love25k wrote:

First I'll give some interpretation for the coefficients:

The minutexvs_0 coefficients are the changes over time for the reference level of strain (WT). The following five coefficients are interactions terms (see diagram in DESeq2 vignette). These give the difference between the change in this time point between the non-reference (mut) and reference strain (WT), controlling for any difference that were present at time 0. If this interaction coefficient is zero it implies both WT and mut experience the same change at this time point, compared to their own time point 0.

yes, those are the top genes by adjusted p-value, where the p-value is described by:

"Genes with small p values from this test are those which at one or more time points after time 0 showed a strain-specific effect."

Yes, for a given time point, you need to follow the code below "Wald tests for the log2 fold changes at individual time points..."

Q: "This compares the mutant strain to the wildtype strain at minute 30 (i think?)."

No. this is an interaction term, so it's not mut vs WT but a difference of differences, see above.

Are there follow up questions, given my answer above? Feel free to comment here with any additional questions.

Hi Michael,

The minutexvs_0 coefficients are the changes over time for the reference level of strain (WT).

I see, so there is no consideration of the mutant here. How is the reference strain determined? Is it alphabetical? Or is this when you 'relevel' the dds object? I think I missed this in the tutorial.

Yes, for a given time point, you need to follow the code below "Wald tests for the log2 fold changes at individual time points..."

Ah, ok, this is interesting. Could you help me understand what the LFC from the LRT represents? If one does a pairwise comparison like this:

results(ddsTC, name="strainmut.minute15", test="LRT")["SPBC2F12.09c",]


The resulting LFC represents the difference of differences observed between the mutant and wildtype strains at minute 15 relative to time 0? They DO NOT represent the LFC of that gene between mut and wt at minute15?

However, if one would like to say which genes are upregulated in the mutant at time 15, we would have to use the Wald test. These values might not be significant (depending on thresholds) but will give a more accurate measure of the counts differences between the two strains. This, however, does not take into account the inherit differences between mut and wt strains at time 0 - which seems bad depending on your question.

I'm sorry I'm struggling with this, I would just like to make sure I use the tool properly. I understand that it is also important to consider the scientific question when choosing the statisical test. Would it be possible to provide some quick advice on this? Perhaps I've been using the wrong design formula to being with:

I have a parasite growing with and without a cell line - 3 bioreps at 6 time points (0-24h). I was treating these as 'wildtype' (parasite cells only) and 'mutant' (parasite +cell line) and following the time series tutorial (same design formula etc.). We are interested in understanding how the expression of the parasite changes over time while interacting with the cell line, while ignoring the time-specific expression of the parasite (that is, without the cell line, over time). I think I should report the genes that have a significant padj value with the LRT (as these represent those that change over time).

But I would also like to examine time specific differences between parasite-only and parasite-host. I am unsure which test (pairwise LRT or pairwise Wald) to use when discussing the LFC of each gene at each time point.

Thanks for your reply (and seemingly endless patience with us statistically-ignorant biologists)

Yes, see "Note on factor levels" in the early part of the DESeq2 vignette. You need to specify to R what is the reference level of a factor or it will be alphabetical.

"The resulting LFC represents the difference of differences observed between the mutant and wildtype strains at minute 15 relative to time 0? They DO NOT represent the LFC of that gene between mut and wt at minute15?"

Yes. that's what interaction terms give you.

"if one would like to say which genes are upregulated in the mutant at time 15"

You have to be more specific. Up-regulated, relative to what?

Hi Michael,

Sorry: "if one would like to say which genes are upregulated in the mutant at time 15 compared to the wild type at time 15 you must do a pairwise Wald test" Would this be written as:

results(ddsTC, name="strainmut.minute15", test="Wald")

Would this produce the same result as making an additional metadata column (let's say STRAIN-TIME; e.g., mut.15, wt.15 etc) and then performing:

results(ddsTC, contrast=c("STRAIN-TIME", "mut.15", "wt.15"), test="Wald")

Likewise for my analysis, I would like to report at each time point which parasite genes are upregulated or downregulated when exposed to the host cells.

Thanks!

The first table above is not the difference between mut and WT at time 15. It's the difference, after removing the difference present at time 0. That's what makes it an interaction term. If you want to add that back in (the difference at time 0), you would do so via contrast=list(c("strain_mut_vs_wt","strainmut.minute15")). Many examples like this in the help page for results:

?results

This is then what you would get with the second table.

Thanks Michael! I had looked at the results section, but I struggled with thinking about the provided examples in terms of time as they seem to be more based on conditions + genotypes and not with comparing to time 0. But I'll keep trying!

(Hopefully) just a couple more things. When you say:

The first table above is not the difference between mut and WT at time 15. It's the difference, after removing the difference present at time 0.

Does this mean: "The first table above is not the difference between mut and WT at time 15. It's the difference of mut and WT after removing the difference of mut and WT present at time 0."

Secondly, which sort of comparison is more meaningful biologically?

(1) Those that remove differences at time zero:

results(ddsTC, name="strainmut.minute15", test="Wald")


(2) Those that include differences at time 0:

results(ddsTC, contrast=list(c("strain_mut_vs_wt","strainmut.minute15")), test="Wald")


I think that it is important to take into account the difference at time zero and therefore remove them (1). If this difference is negligible, that will likely be reflected in the model and you would expect to see the same results as if you included them (2)? I mean, do you think there is any harm in always considering the differences at time 0 (and removing them, so in the first table)?

Finally, why must you use the Wald test? Is there a time when you would want to use the pairwaise LRT?

Thanks!

I think you’d be better off meeting with a statistician to discuss how interactions work in a linear model. I try to provide software support here on the support site, but I don’t really have time to provide extended statistical and analysis support. For this you should really find a statistical or quantitative collaborator who has worked with linear models.