Question: high lcfSE for genes with zero counts in the majority of conditions
0
gravatar for Fred
28 days ago by
Fred0
Fred0 wrote:

Hi DESeq2 developers,

I have run DEseq2 on a dataset composed by 90 samples (18 different groups x 5 replicates each) and I am interested in the geneA that is known by literature to be highly expressed in 1 group (S04) and has zero counts in almost all the others (attached at the end of the post)

dds <- DESeqDataSetFromMatrix(countData = cts,
                          colData = col_final,
                          design = ~stage)
dds2 <- DESeq(dds)
resultsNames(dds2)
out <- results(dds2, name="stage_S04_vs_S03")

This is the result I obtain for geneA:

gene    baseMean  log2FoldChange    lfcSE     stat      pvalue    padj
geneA   7.6376     9.5624            4.5052   2.1225    0.03379   0.1128

GeneA is highly over expressed however the lfcSE is high and consequently the padj is not significant (if considering a threshold of .05).

I know DESeq2 has been well designed in order to handle zero counts but this result sounds strange to me and so I tried to add 1 at all the raw count values of all the genes in my dataset and I performed again the analysis obtaining the following result:

gene    baseMean    log2FoldChange  lfcSE   stat      pvalue        padj
geneA   15.1641      7.2097          0.6964 10.3520   4.10E-25  1.49E-22

Here, the result is highly significant.

My question is, why this difference just adding 1 to all the gene raw counts? Should I perform the DE analysis in a different way in order to make it fit better on genes with zero counts in almost all samples?

Thanks in advance.

Fred

Raw counts for geneA:

gene    S01_R1  S01_R2  S01_R3  S01_R4  S01_R5  S02_R1  S02_R2  S02_R3  S02_R4  S02_R5  S03_R1  S03_R2  S03_R3  S03_R4  S03_R5  S04_R1  S04_R2  S04_R3  S04_R4  S04_R5  S05_R1  S05_R2  S05_R3  S05_R4  S05_R5  S06_R1  S06_R2  S06_R3  S06_R4  S06_R5  S07_R1  S07_R2  S07_R3  S07_R4  S07_R5  S08_R1  S08_R2  S08_R3  S08_R4  S08_R5  S09_R1  S09_R2  S09_R3  S09_R4  S09_R5  S10_R1  S10_R2  S10_R3  S10_R4  S10_R5  S11_R1  S11_R2  S11_R3  S11_R4  S11_R5  S12_R1  S12_R2  S12_R3  S12_R4  S12_R5  S13_R1  S13_R2  S13_R3  S13_R4  S13_R5  S14_R1  S14_R2  S14_R3  S14_R4  S14_R5  S15_R1  S15_R2  S15_R3  S15_R4  S15_R5  S16_R1  S16_R2  S16_R3  S16_R4  S16_R5  S17_R1  S17_R2  S17_R3  S17_R4  S17_R5  S18_R1  S18_R2  S18_R3  S18_R4  S18_R5
geneA   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   157 217 77  182 123 67  136 76  60  77  0   10  10  8   0   0   5   5   0   0   0   0   0   4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

normalized counts for geneA:

gene    S01_R1  S01_R2  S01_R3  S01_R4  S01_R5  S02_R1  S02_R2  S02_R3  S02_R4  S02_R5  S03_R1  S03_R2  S03_R3  S03_R4  S03_R5  S04_R1  S04_R2  S04_R3  S04_R4  S04_R5  S05_R1  S05_R2  S05_R3  S05_R4  S05_R5  S06_R1  S06_R2  S06_R3  S06_R4  S06_R5  S07_R1  S07_R2  S07_R3  S07_R4  S07_R5  S08_R1  S08_R2  S08_R3  S08_R4  S08_R5  S09_R1  S09_R2  S09_R3  S09_R4  S09_R5  S10_R1  S10_R2  S10_R3  S10_R4  S10_R5  S11_R1  S11_R2  S11_R3  S11_R4  S11_R5  S12_R1  S12_R2  S12_R3  S12_R4  S12_R5  S13_R1  S13_R2  S13_R3  S13_R4  S13_R5  S14_R1  S14_R2  S14_R3  S14_R4  S14_R5  S15_R1  S15_R2  S15_R3  S15_R4  S15_R5  S16_R1  S16_R2  S16_R3  S16_R4  S16_R5  S17_R1  S17_R2  S17_R3  S17_R4  S17_R5  S18_R1  S18_R2  S18_R3  S18_R4  S18_R5
geneA   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   65.87101129 89.66876161 43.13716558 95.51107793 66.68373543 47.89012269 85.70241589 53.78061654 42.17331857 53.9678294  0   10.34768275 9.560664419 7.853550795 0   0   5.783434837 4.973188666 0   0   0   0   0   4.48195665  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

normalized counts for geneA. Normalization made after adding +1 to raw counts of all genes:

gene    S01_R1  S01_R2  S01_R3  S01_R4  S01_R5  S02_R1  S02_R2  S02_R3  S02_R4  S02_R5  S03_R1  S03_R2  S03_R3  S03_R4  S03_R5  S04_R1  S04_R2  S04_R3  S04_R4  S04_R5  S05_R1  S05_R2  S05_R3  S05_R4  S05_R5  S06_R1  S06_R2  S06_R3  S06_R4  S06_R5  S07_R1  S07_R2  S07_R3  S07_R4  S07_R5  S08_R1  S08_R2  S08_R3  S08_R4  S08_R5  S09_R1  S09_R2  S09_R3  S09_R4  S09_R5  S10_R1  S10_R2  S10_R3  S10_R4  S10_R5  S11_R1  S11_R2  S11_R3  S11_R4  S11_R5  S12_R1  S12_R2  S12_R3  S12_R4  S12_R5  S13_R1  S13_R2  S13_R3  S13_R4  S13_R5  S14_R1  S14_R2  S14_R3  S14_R4  S14_R5  S15_R1  S15_R2  S15_R3  S15_R4  S15_R5  S16_R1  S16_R2  S16_R3  S16_R4  S16_R5  S17_R1  S17_R2  S17_R3  S17_R4  S17_R5  S18_R1  S18_R2  S18_R3  S18_R4  S18_R5
geneA   1.635774184 1.719785637 1.702839298 1.703605039 1.903465796 1.430847931 1.421261525 1.407608774 1.380372662 1.311115729 1.079183208 1.09682498  1.051560729 1.090648727 1.084936567 162.9431838 223.0955085 84.35498283 195.9930043 132.978557  70.12744619 138.2285046 79.40901995 63.31627122 80.60083613 1.118879863 12.43936071 12.04617822 9.99266181  1.203508913 1.267958091 7.195204691 6.683102755 1.136002475 1.169048978 1.146531926 1.127530336 1.159288781 5.82279321  1.194192964 0.987346071 1.031285973 1.039259226 0.964340819 1.007731369 1.042617247 1.020616152 0.990375684 1.00272009  1.007731369 1.015522513 0.996295515 0.858884969 0.992064448 0.861797818 1.022674459 1.046543383 0.799627887 0.856250253 1.048665614 1.035942571 1.081474139 1.023373892 1.023373892 1.002792029 1.007731369 0.891690055 1.023373892 1.015522513 1.039259226 1.023373892 1.031285973 1.007731369 1.007731369 1.007731369 0.895166785 0.925114122 0.831105998 0.839899029 0.929809369 0.797988453 0.87220481  0.716255333 0.818337073 0.828413752 0.944984884 0.932353706 0.880071705 0.973851877 1.007731369
deseq2 • 92 views
ADD COMMENTlink modified 28 days ago by Michael Love25k • written 28 days ago by Fred0
Answer: high lcfSE for genes with zero counts in the majority of conditions
0
gravatar for Michael Love
28 days ago by
Michael Love25k
United States
Michael Love25k wrote:

Thanks for the report. Is this the latest version of DESeq2?

I wouldn't add 1 to the data, but I agree it's not giving an accurate SE estimate here.

Another two checks I'd like to see are that LRT and also the shrinkage approaches may give more appropriate test statistics.

The LRT can be performed by making S03 the reference level, and re-running DESeq():

dds$condition <- relevel(dds$condition, "S03")
mm <- model.matrix(design(dds), colData(dds))
dds <- DESeq(dds, test="LRT", full=mm, reduced=mm[,-idx])
res <- results(dds)

Where idx is the number of the column of mm that is the S04 vs S03 comparison.

I'd also be curious about this line:

lfc <- lfcShrink(dds, coef="condition_S04_vs_S03", type="apeglm", svalue=TRUE)

And whether the Bayesian estimation and s-values do a better job here.

ADD COMMENTlink written 28 days ago by Michael Love25k

Thanks Michael for your suggestions.

First of all, the DEseq2 version I'm using is the DESeq21.22.2 and thus not the last one (my fault!). I've updated my DESeq2 version but nothing changes. (Rversion 3.6.1, Platform: x8664-apple-darwin15.6.0 (64-bit), Running under: macOS High Sierra 10.13.6, DESeq2_1.24.0)

I then tried with LTR using the following commands:

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = col_final,
                              design = ~stage)
dds$stage <- relevel(dds$stage,"S03")
mm <- model.matrix(design(dds), colData(dds))
dds2 <- DESeq(dds, test="LRT", full=mm, reduced=mm[,-4])
out <- results(dds2, name="S04")

For geneA I still obtain high lfcSE:

log2 fold change (MLE): stageS04
LRT p-value: full vs reduced 
DataFrame with 1 row and 6 columns
       baseMean   log2FoldChange    lfcSE      stat     pvalue      padj
       <numeric>   <numeric>      <numeric>   <numeric> <numeric>   <numeric>
geneA   7.6376            9.5624    4.5052   2.5804    0.1081       0.2741

ThenI tried with Baesyan estimation using the following commands:

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = col_final,
                              design = ~stage)
dds$stage <- relevel(dds$stage,"S03")
dds2 <- DESeq(dds)
out <- results(dds2, name="stage_S04_vs_S03")
lfc <- lfcShrink(dds2, coef="stage_S04_vs_S03", type="apeglm", svalue=TRUE)

For geneA I obtain the following result:

log2 fold change (MAP): stage S04 vs S03 

DataFrame with 1 row and 4 columns
           baseMean     log2FoldChange  lfcSE      svalue
           <numeric>      <numeric>     <numeric>   <numeric>
geneA      7.6376         0.0455         0.3201     0.1709

Did I do everything fine? Any other idea?

I forgot to mention that the DESeq() is giving the following warning:

fitType='parametric', but the dispersion trend was not well captured by the
   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.

Thanks.

Fred

ADD REPLYlink written 27 days ago by Fred0
1

hi Fred,

I took a look and two things were going on that lead to bad estimates of dispersion. One was that there were many genes with all samples having 0 or very low counts, and these were creating a bad dispersion fit. The fit was improved by removing genes where there weren't 5 samples with a count of 10 or higher.

keep <- rowSums(counts(dds) >= 10) >= 5
dds <- dds[keep,]

The other factor was that with so many samples, the standard minimum estimated count was giving overly high estimates of dispersion for these genes that have a low mean across all samples. This also happens in scRNA-seq for example, and the solution to improve DESeq2's dispersion estimates is to lower this value to 1e-6:

dds <- DESeq(dds, minmu=1e-6)

These two alterations lead to this gene getting the expected SE:

res <- results(dds, contrast=c("condition","S04","S03"))
       baseMean log2FoldChange          lfcSE
           7.64          26.55           1.02
ADD REPLYlink written 27 days ago by Michael Love25k

Great, it worked! Thanks Michael.

Fred

ADD REPLYlink written 26 days ago by Fred0

Sorry to bother you again.

I think something is still missing. Using the command below:

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = col1,
                              design = ~condition)
keep <- rowSums(counts(dds) >= 10) >= 5
dds <- dds[keep,]
dds <- DESeq(dds, minmu=1e-6)
res <- results(dds, contrast=c("condition","S04","S03"))

The gene called "gene25047" results significantly upregulated:

res["gene25047",]

                  baseMean   log2FoldChange   lfcSE     stat       pvalue     padj
    gene25047     46.1821      6.6849         0.9968    6.7063   1.995e-11   6.0765e-10

However it has 0 raw counts in all the S03 and a raw counts > 0 just in 1 replicate

(counts(dds)["gene25047",])[11:20]

S03_R1_1 S03_R2_1 S03_R3_1 S03_R4_1 S03_R5_1 S04_R1_1 S04_R2_1 S04_R3_1 S04_R4_1 S04_R5_1 
       0        0        0        0        0        0       21        0        0        0

Did I do something wrong?

Thanks. Fred

ADD REPLYlink written 26 days ago by Fred0

One thing is that, I don't think this would be significant with LRT or with apeglm code from above. I've seen that both of these approaches do better for these comparisons of many groups where some number of the groups have all 0's. Could you check that?

What I think is happening here is again bad SE estimation, but now it's because the other groups for this gene are informing the model of a low dispersion. The Wald requires well estimated SE and these are difficult cases for it because they lie on the parameter boundary.

Another approach -- given that you informed me of the group relationships over email -- instead of per group comparisons would be to model the expression across the groups using a spline, e.g. design = ~ns(x) where x is a numeric specifying the relationship of the groups to each other (e.g. a grid).

ADD REPLYlink written 26 days ago by Michael Love25k

Hi Michael, I confirm that the gene25047 is not significant using the other approaches you suggested me. I will also try to use design = ~ns(x).

Thanks Fred

ADD REPLYlink written 24 days ago by Fred0

Could i take a look at the dds object? Something seems strange. I wonder if the dispersion estimation is being thrown off by some subset of features.

Could you email to maintainer(“DESeq2”)?

ADD REPLYlink written 27 days ago by Michael Love25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 199 users visited in the last hour