I have performed a diffential expression analysis on RNAseq data of 17 treated and 21 untreated samples using DESeq2_1.22.2. The results look good except for a gene showing a log2FoldChange < -20. I found a strong discrepancy between the DESeq2 log2FoldChange (-24.72) and the one calculated manually (-2.24).
Have I missed something or why is there such a big difference?
> dds = DESeq(dds, betaPrior=FALSE)
using pre-existing normalization factors
estimating dispersions
found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 411 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
> res = results(dds, contrast = c("condition", "treated", "untreated"), alpha = 0.05)
> round(counts(dds, normalized=TRUE)["ENSG00000274229",which(dds$condition == "treated")])
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s17
0 0 0 241 0 0 0 0 0 0 0 0 0 0 0 0 0
> mean(counts(dds, normalized=TRUE)["ENSG00000274229",which(dds$condition == "treated")])
[1] 14.16428
> round(counts(dds, normalized=TRUE)["ENSG00000274229",which(dds$condition == "untreated")])
s18 s19 s20 s21 s22 s23 s24 s25 s26 s27 s28 s29 s30 s31 s32 s33 s34 s35 s36 s37 s38 s39
0 232 253 0 0 0 197 0 0 0 336 0 0 0 0 126 0 267 0 0 0
> mean(counts(dds, normalized=TRUE)["ENSG00000274229",which(dds$condition == "untreated")])
[1] 67.17753
### Manually calculated log2FoldChange ###
> log2(14.17647/67.19048)
[1] -2.244758
### log2FoldChange by DESeq2 ###
> res["ENSG00000274229",]
log2 fold change (MLE): condition treated vs untreated
Wald test p-value: condition treated vs untreated
DataFrame with 1 row and 9 columns
baseMean log2FoldChange lfcSE stat pvalue padj symbol entrez
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <character> <character>
ENSG00000274229 37.124426506691 -24.7224768163257 2.65930667212744 -9.29658736822098 1.45025771568199e-20 2.67558045966171e-16 SOCS7 30837
ensgid
<character>
ENSG00000274229 ENSG00000274229
It looks much better now.
So do you recommend using
minRep=Inf
as default?I personally use
minRep=Inf
when I run DESeq() on large datasets, because I prefer to perform outlier checks myself, by looking at genes with large Cook's distance (see vignette) by eye.However, to be consistent with the paper, I haven't changed the software default, and instead we've worked on other approaches which are more robust.