Hello,
I am running DESeq2 on an experiment with Treatment (5 of them) and Gender (F and M) as part of the design.
Everything runs ok but I am puzzled about some of the results. I was comparing the results with DESeq and in the process ran it once with Treatment only and separately with Treatment+Gender. Then I looked at the results for a specific gene and got the following, intriguing, result:
Say I run it twice:
ddsTreatmentAndGender <- DESeqDataSetFromMatrix(countData = countsMatrix,
colData = design,
design = ~Treatment+Gender)
> resultForPairDESeq2 <- results(ddsTreatmentAndGender,contrast=c("Treatment","Tri","Cont"))
> resultForPairDESeq2[rownames(resultForPairDESeq2)=="ENSMUSG00000023987",]
log2 fold change (MAP): Treatment Tri vs Cont
Wald test p-value: Treatment Tri vs Cont
DataFrame with 1 row and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSMUSG00000023987 20.83871 0.6267117 0.1429024 4.385592 1.156708e-05
padj
<numeric>
ENSMUSG00000023987 0.0009277118
ddsTreatmentOnly <- DESeqDataSetFromMatrix(countData = countsMatrix,
colData = design,
design = ~Treatment)
> resultForPairDESeq2 <- results(ddsTreatmentOnly,contrast=c("Treatment","Tri","Cont"))
> resultForPairDESeq2[rownames(resultForPairDESeq2)=="ENSMUSG00000023987",]
log2 fold change (MAP): Treatment Tri vs Cont
Wald test p-value: Treatment Tri vs Cont
DataFrame with 1 row and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSMUSG00000023987 16.58064 1.051834 0.1611896 6.525445 NA
padj
<numeric>
ENSMUSG00000023987 NA
>
In other words, with Treatment+Gender as the design this gene appears as signficant when comparing Tri and Control, whereas it does not even pass filter when the design consists of only Treatment. Why is that??????
The raw counts for that specific gene are:
ENSMUSG00000023987 Treatment Gender
1 1 Cont F
2 1 Cont F
3 1 Cont F
4 1 Cont F
5 2 Cont M
6 397 Tri F
7 8 Tri F
8 4 Tri M
9 3 Tri M
10 7 Tri M
Any insight would be greatly appreciated.
Thanks in advance,
Ramiro
oh and with respect to:
"This is what you'd expect -- basically when you don't include "Gender" in your model, there is a source of variability that the model can not account for, and therefore can not make a significance call. When "Gender" is included, you acknowledge (and control for) difference among expression that can be attributed to Gender."
It seems that the issue in question here is not in partitioning variability but rather in that with Treatment only, the outlier is detected yet NOT being removed, per the DESEq2 documentation:
"If a row contains a sample with an extreme count outlier then the p value and adjusted p value are set
to NA. These outlier counts are detected by Cook's distance"
Since there are so few replicates (5 of each) then the outlier and the count NOT replaced with the average of the rest, as it might do otherwise. Thus the p-value is set to NA. Does this sound right to you? Or is there something to the model?
The question is: what is happening when the model includes Gender? What is happening with this outlier? It still has a high Cook's distance value, yet I don't see the matrices replaceCounts and replaceCooks on the assay(dds) list.
Any insight appreciated. Thanks.
hi Ramiro,
When you add gender to the design, then you reduce the number of replicates. You have 2 replicates of F + Tri, so it is not filtering. From? results: "Note: this test excludes the Cook’s distance of samples belonging to experimental groups with only 2 samples." and from the vignette: "At least 3 replicates are required for flagging, as it is difficult to judge which sample might be an outlier with only 2 replicates."
Thanks so much. This was the answer!!! I have been doing the ph525.5x course by the way, it's been really helpful so thanks so much for that as well.
Hi Steve,
Thank you very much for your answer, it was very helpful. I did as you suggested, let me paste again the data for this gene:
expr Treatment Gender
1 1 Cont F
2 1 Cont F
3 1 Cont F
4 1 Cont F
5 2 Cont M
6 397 Tri F
7 8 Tri F
8 4 Tri M
9 3 Tri M
10 7 Tri M
It has 397 counts on sample 6, it is correctly being marked as an outlier when I do Treatment alone (very large Cook's distance, etc). However, when I do Treatment+Gender I am not sure what it is doing (I look at the assay(dds) and there are no replaceCounts replaceCooks matrices). In any case, it does have a high Cooks distance (way more than the qf(0.75,3,31-3)), so why is this gene appearing as significant?
Let me paste the image you suggested:
ggplot(example, aes(Treatment, expr, color=Gender, fill=Gender)) +
geom_boxplot(outlier.colour=NA) +
geom_point(position=position_jitterdodge(dodge.width=0.25))
Shouldn't the high count be marked as an outlier regardless of the design, and thus no significance attained??? With Treatment+Gender it has an adjusted p-value of 0.0009277118, thus it seems to be using the outlier as a count.
Again, thank you very much for your time. This is very helpful.
Ramiro