Question

DESeq2 log2FoldChange is very large and differs strongly from manually calculated.

0

Entering edit mode

Matthias Munz ▴ 20

@matmu

Last seen 8 months ago

Germany

I have performed a diffential expression analysis on RNAseq data of 17 treated and 21 untreated samples using DESeq2_1.22.2. The results look good except for a gene showing a log2FoldChange < -20. I found a strong discrepancy between the DESeq2 log2FoldChange (-24.72) and the one calculated manually (-2.24).

Have I missed something or why is there such a big difference?

> dds = DESeq(dds, betaPrior=FALSE)

using pre-existing normalization factors
estimating dispersions
found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 411 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing

> res = results(dds, contrast = c("condition", "treated", "untreated"), alpha = 0.05)

> round(counts(dds, normalized=TRUE)["ENSG00000274229",which(dds$condition == "treated")])
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s17 
0 0 0 241 0 0 0 0 0 0 0 0 0 0 0 0 0

> mean(counts(dds, normalized=TRUE)["ENSG00000274229",which(dds$condition == "treated")])
[1] 14.16428

> round(counts(dds, normalized=TRUE)["ENSG00000274229",which(dds$condition == "untreated")])
s18 s19 s20 s21 s22 s23 s24 s25 s26 s27 s28 s29 s30 s31 s32 s33 s34 s35 s36 s37 s38 s39
0 232 253 0 0 0 197 0 0 0 336 0 0 0 0 126 0 267 0 0 0

> mean(counts(dds, normalized=TRUE)["ENSG00000274229",which(dds$condition == "untreated")])
[1] 67.17753

### Manually calculated log2FoldChange ### 
> log2(14.17647/67.19048)
[1] -2.244758

### log2FoldChange by DESeq2 ### 
> res["ENSG00000274229",]
log2 fold change (MLE): condition treated vs untreated
Wald test p-value: condition treated vs untreated
DataFrame with 1 row and 9 columns
                       baseMean    log2FoldChange            lfcSE              stat               pvalue                 padj      symbol      entrez
                      <numeric>         <numeric>        <numeric>         <numeric>            <numeric>            <numeric> <character> <character>
ENSG00000274229 37.124426506691 -24.7224768163257 2.65930667212744 -9.29658736822098 1.45025771568199e-20 2.67558045966171e-16       SOCS7       30837
                         ensgid
                    <character>
ENSG00000274229 ENSG00000274229

deseq2 RNAseq • 2.6k views

ADD COMMENT • link updated 6.8 years ago by Michael Love 43k • written 6.8 years ago by Matthias Munz ▴ 20

score 1 · Answer 1 · 2019-04-09

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 5 days ago

United States

Can you run DESeq() again with minRep=Inf?

I wonder if the outlier replacement is doing something bad for this gene, by replacing the single large count for the treated group. I often don’t like what the outlier flagging or replacement procedure does, but we’ve kept it in with the option to turn it off, so as to remain consistent with the 2014 publication. Meanwhile, we have been developing more robust procedures, either Bayesian or nonparametric.

ADD COMMENT • link 6.8 years ago Michael Love 43k

0

Entering edit mode

It looks much better now.

> dds = DESeq(dds, betaPrior=FALSE, minRep=Inf)
using pre-existing normalization factors
estimating dispersions
found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

> res = results(dds, contrast = c("condition", "treated", "untreated"), alpha = 0.05)
> res["ENSG00000274229",]
log2 fold change (MLE): condition treated vs untreated
Wald test p-value: condition treated vs untreated
DataFrame with 1 row and 6 columns
                        baseMean    log2FoldChange            lfcSE               stat    pvalue      padj
                       <numeric>         <numeric>        <numeric>          <numeric> <numeric> <numeric>
ENSG00000274229 43.4610768367333 -2.24473731039516 2.90330104518617 -0.773167258737103        NA        NA

So do you recommend using minRep=Inf as default?

ADD REPLY • link 6.8 years ago Matthias Munz ▴ 20

0

Entering edit mode

I personally use minRep=Inf when I run DESeq() on large datasets, because I prefer to perform outlier checks myself, by looking at genes with large Cook's distance (see vignette) by eye.

However, to be consistent with the paper, I haven't changed the software default, and instead we've worked on other approaches which are more robust.

ADD REPLY • link 6.8 years ago Michael Love 43k