DESeq2: Significant padj and LFC in a 0 vs. 20 counts comparison?
1
0
Entering edit mode
vanbelj ▴ 30
@vanbelj-21216
Last seen 10 months ago
United States

A gene in my dataset has average counts of 0 (control) vs. 13.3 (treated). My DESeq2 analysis determines the gene to have a padj = 5.8E-6 and an average LFC of 6.6.

1) How can a LFC value be calculated in this scenario?

2) While I can see how a statistical test could potentially yield padj ≤ 0.05, this isn't a hit I would follow up on. Anyone have first-hand experience or know of literature that determines whether such small shifts in expression (e.g. 0 to 15 counts) correlate with meaningful biology? Am I naive to discredit "hits" arising from such low counts?

Thanks for any advice! Code with data below.

First, I took the 25 genes with highest LFC and padj ≤ 0.05. Plotted the rlog values in a heat map and noticed YDL022C-A was an outlier.

sigGenes <- as.data.frame(results_shrunk) %>%
  tibble::rownames_to_column("GeneID") %>% 
  filter(padj <= 0.05)

sigTop25 <- sigGenes %>%
  slice_max(log2FoldChange, n=25) %>%
  pull("GeneID")

rld <- rlog(dds_filtered, blind=FALSE)
plotDat <- assay((rld)[sigTop25,])
Heatmap(plotDat)

Checked values in rlog transformation:

> assay((rld)["YDL022C-A",])

           sample1  sample2  sample3  sample4  sample5  sample6
YDL022C-A 1.723071 3.014438 1.724505 3.134384 1.730045 2.827426

Seemed fairly low, so I checked results values, where "sigTop25" were pulled from, above.

>results_shrunk["YDL022C-A",]

log2 fold change (MAP): Treatment 0.4M NaCl 15min vs 0.4M NaCl 0min 
Wald test p-value: Treatment 0.4M NaCl 15min vs 0.4M NaCl 0min 
DataFrame with 1 row and 5 columns
           baseMean log2FoldChange     lfcSE      pvalue        padj
          <numeric>      <numeric> <numeric>   <numeric>   <numeric>
YDL022C-A   7.14097         6.6092   2.59133 2.84041e-06 5.81039e-06

The lfcSE was high, so took a look at the raw count data.

> counts(dds_filtered)["YDL022C-A",]

sample1 sample2 sample3 sample4 sample5 sample6 
      0      13       0      16       0      11 

> sum(counts(dds_filtered))
[1] 229073425
`

The odd-numbered samples belong to the control group - all have 0 counts.

DESeq2 • 2.0k views
ADD COMMENT
2
Entering edit mode
@mikelove
Last seen 1 day ago
United States

1) The LFC is infinite, but we bound the E(Y) and also limit the coefficient from going to +/- infinity during GLM fitting.

2) I'd recommend using lfcShrink to prioritize genes. You can pick those with a large shrunken LFC.

Alternatively you could filter out genes that don't satisfy:

keep <- rowSums(counts(dds) >= X) >= 3

You could raise X to 10 or 20 depending on which genes you don't prefer to include in results.

ADD COMMENT
0
Entering edit mode

Thanks Michael! I pulled top hits from results_shrunk, which has been run with lfcShrink using apeglm. In your keep recommendation, is this saying that that 3 or more samples must have counts with X or more? I do think I'd like using that as a cutoff rather than the sum of counts from all samples.

ADD REPLY
1
Entering edit mode

Oh I see now, sorry I missed that. Was reading too fast. Then the answer to (1) should have been that we use Bayes formula to provide finite shrunken LFC here.

Well arguably it has very low dispersion and an infinite LFC. The posterior distribution has the mode about 2.5 posterior SDs from 0. So it the uncertainty is coming through in my opinion.

Re: Q2, yes that's what the keep code is saying.

ADD REPLY
0
Entering edit mode

Thanks again, Michael. Do you know of any literature that has addressed biological relevance in DE hits such as this? If they aren't relevant, then best to have more stringent filtering so they are removed before analysis. My worry is that too many of these types of hits will inflate my gene pool for downstream analysis - such as GOTerm Enrichment.

ADD REPLY
1
Entering edit mode

I don't know of any literature discussing the biological relevance of lowly expressed DE genes vs highly expressed DE genes.

ADD REPLY
0
Entering edit mode

We have data sets with more samples and replicates (6 samples, 5 replicates of each) than the above example. We have issues with very high LFC and padj in some transcripts with only 5-6 samples having non-zero counts (sometimes 100s of them), while DEseq2 ignores very similar values in others. We are arbitrarily using the "keep" filter at various levels. Are you aware of a non-arbitrary, data based method of determining these filter values?

ADD REPLY
0
Entering edit mode

Are you using lfcShrink, or are you referring to MLE LFC? We've written two papers about moderation of LFC improving gene ranking (original DESeq2, and the new apeglm shrinkage estimator).

ADD REPLY
0
Entering edit mode

Thank you for your quick reply. I was not aware of the apeglm paper. We had been trying the approaches suggested in the limma articles, of a cpm based cutoff and a non-zero sample cutoff, but could not find an objective approach to determining these values based on each data set. As mentioned above we can routinely see transcripts where there are 0 values in 80-90% of samples.

Our data is also different from the examples here, in that they are 5 replicates of 6 different experimental conditions (1 control, 5 sequential dilutions). We compare each dilution to the control. Will DEseq2 look at the variation in all 30 samples to estimate the shrinkage values, or only the 2 being compared?

ADD REPLY
0
Entering edit mode

DESeq2 uses all the samples for dispersion estimation and lfcShrink-based shrinkage of a pairwise comparison essentially just takes into account the count distribution of the two groups.

You can try it out, and see if it looks reasonable for your purposes. We also implemented the ability to use ashr-based shrinkage, there is a section of the DESeq2 vignette that covers shrinkage estimators.

ADD REPLY
0
Entering edit mode

Thank you, we will try both apeglm and the ashr shrinkage methods.

If they aren't successful are there row sum cutoffs, or sample number cutoffs that can be applied through DEseq2? I have posted a small example of what we are seeing, where the first 2 transcripts are listed as ~0 LFC, p=N/A, but the 3rd transcript as fold change 31 and p = 0.000006.

     C1 C2  C3 C4 C5 t1 t2 t3 t4 t5

transcript1 0 1 0 0 0 0 0 1 2 0

transcript2 597 686 416 370 835 5 606 546 483 789

transcript3 2 0 0 0 0 0 0 0 0 0

ADD REPLY
0
Entering edit mode

Ok, so aside from lfcShrink methods being a solution, if part of a larger group of samples, the last row may have a low dispersion overall (due to other sample count distributions) and then the infinite MLE LFC. I recommend for such datasets to just filter out genes with low counts (e.g. < 10) for nearly all samples in a particular comparison.

E.g. the following code (and you can use square bracket indexing to enforce more than X samples with count of 10 or more in a comparison of interest before running DESeq).

keep <- rowSums(counts(dds) >= 10) >= X
dds <- dds[keep,]
ADD REPLY
0
Entering edit mode

I really appreciate your help with this, we've been stuck on this for some time. The data I showed you comes from a full (~25k transcripts, 30 samples) experiment.

Is there a non-arbitrary way to select the count of 10 or sample filter using a a data-based algorithm? The limma help article suggests some values but they don't justify them.

ADD REPLY
0
Entering edit mode

My preferred non-arbitrary solution is to use lfcShrink.

ADD REPLY
0
Entering edit mode

Very good, I will have lfcShrink run and then let you know how it goes. Thanks again.

ADD REPLY

Login before adding your answer.

Traffic: 592 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6