DESeq2 log2FoldChange is very large and differs strongly from manually calculated.
1
0
Entering edit mode
@matmu
Last seen 5 weeks ago
Germany

I have performed a diffential expression analysis on RNAseq data of 17 treated and 21 untreated samples using DESeq2_1.22.2. The results look good except for a gene showing a log2FoldChange < -20. I found a strong discrepancy between the DESeq2 log2FoldChange (-24.72) and the one calculated manually (-2.24).

Have I missed something or why is there such a big difference?

> dds = DESeq(dds, betaPrior=FALSE)

using pre-existing normalization factors
estimating dispersions
found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 411 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing

> res = results(dds, contrast = c("condition", "treated", "untreated"), alpha = 0.05)

> round(counts(dds, normalized=TRUE)["ENSG00000274229",which(dds$condition == "treated")])
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s17 
0 0 0 241 0 0 0 0 0 0 0 0 0 0 0 0 0

> mean(counts(dds, normalized=TRUE)["ENSG00000274229",which(dds$condition == "treated")])
[1] 14.16428

> round(counts(dds, normalized=TRUE)["ENSG00000274229",which(dds$condition == "untreated")])
s18 s19 s20 s21 s22 s23 s24 s25 s26 s27 s28 s29 s30 s31 s32 s33 s34 s35 s36 s37 s38 s39
0 232 253 0 0 0 197 0 0 0 336 0 0 0 0 126 0 267 0 0 0

> mean(counts(dds, normalized=TRUE)["ENSG00000274229",which(dds$condition == "untreated")])
[1] 67.17753

### Manually calculated log2FoldChange ### 
> log2(14.17647/67.19048)
[1] -2.244758

### log2FoldChange by DESeq2 ### 
> res["ENSG00000274229",]
log2 fold change (MLE): condition treated vs untreated
Wald test p-value: condition treated vs untreated
DataFrame with 1 row and 9 columns
                       baseMean    log2FoldChange            lfcSE              stat               pvalue                 padj      symbol      entrez
                      <numeric>         <numeric>        <numeric>         <numeric>            <numeric>            <numeric> <character> <character>
ENSG00000274229 37.124426506691 -24.7224768163257 2.65930667212744 -9.29658736822098 1.45025771568199e-20 2.67558045966171e-16       SOCS7       30837
                         ensgid
                    <character>
ENSG00000274229 ENSG00000274229
deseq2 RNAseq • 2.1k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 20 hours ago
United States

Can you run DESeq() again with minRep=Inf?

I wonder if the outlier replacement is doing something bad for this gene, by replacing the single large count for the treated group. I often don’t like what the outlier flagging or replacement procedure does, but we’ve kept it in with the option to turn it off, so as to remain consistent with the 2014 publication. Meanwhile, we have been developing more robust procedures, either Bayesian or nonparametric.

ADD COMMENT
0
Entering edit mode

It looks much better now.

> dds = DESeq(dds, betaPrior=FALSE, minRep=Inf)
using pre-existing normalization factors
estimating dispersions
found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

> res = results(dds, contrast = c("condition", "treated", "untreated"), alpha = 0.05)
> res["ENSG00000274229",]
log2 fold change (MLE): condition treated vs untreated
Wald test p-value: condition treated vs untreated
DataFrame with 1 row and 6 columns
                        baseMean    log2FoldChange            lfcSE               stat    pvalue      padj
                       <numeric>         <numeric>        <numeric>          <numeric> <numeric> <numeric>
ENSG00000274229 43.4610768367333 -2.24473731039516 2.90330104518617 -0.773167258737103        NA        NA

So do you recommend using minRep=Inf as default?

ADD REPLY
0
Entering edit mode

I personally use minRep=Inf when I run DESeq() on large datasets, because I prefer to perform outlier checks myself, by looking at genes with large Cook's distance (see vignette) by eye.

However, to be consistent with the paper, I haven't changed the software default, and instead we've worked on other approaches which are more robust.

ADD REPLY

Login before adding your answer.

Traffic: 509 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6