A gene in my dataset has average counts of 0 (control) vs. 13.3 (treated). My DESeq2 analysis determines the gene to have a padj = 5.8E-6 and an average LFC of 6.6.
1) How can a LFC value be calculated in this scenario?
2) While I can see how a statistical test could potentially yield padj ≤ 0.05, this isn't a hit I would follow up on. Anyone have first-hand experience or know of literature that determines whether such small shifts in expression (e.g. 0 to 15 counts) correlate with meaningful biology? Am I naive to discredit "hits" arising from such low counts?
Thanks for any advice! Code with data below.
First, I took the 25 genes with highest LFC and padj ≤ 0.05. Plotted the rlog values in a heat map and noticed YDL022C-A was an outlier.
sigGenes <- as.data.frame(results_shrunk) %>%
tibble::rownames_to_column("GeneID") %>%
filter(padj <= 0.05)
sigTop25 <- sigGenes %>%
slice_max(log2FoldChange, n=25) %>%
pull("GeneID")
rld <- rlog(dds_filtered, blind=FALSE)
plotDat <- assay((rld)[sigTop25,])
Heatmap(plotDat)
Checked values in rlog transformation:
> assay((rld)["YDL022C-A",])
sample1 sample2 sample3 sample4 sample5 sample6
YDL022C-A 1.723071 3.014438 1.724505 3.134384 1.730045 2.827426
Seemed fairly low, so I checked results values, where "sigTop25" were pulled from, above.
>results_shrunk["YDL022C-A",]
log2 fold change (MAP): Treatment 0.4M NaCl 15min vs 0.4M NaCl 0min
Wald test p-value: Treatment 0.4M NaCl 15min vs 0.4M NaCl 0min
DataFrame with 1 row and 5 columns
baseMean log2FoldChange lfcSE pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric>
YDL022C-A 7.14097 6.6092 2.59133 2.84041e-06 5.81039e-06
The lfcSE was high, so took a look at the raw count data.
> counts(dds_filtered)["YDL022C-A",]
sample1 sample2 sample3 sample4 sample5 sample6
0 13 0 16 0 11
> sum(counts(dds_filtered))
[1] 229073425
`
The odd-numbered samples belong to the control group - all have 0 counts.
Thanks Michael! I pulled top hits from
results_shrunk
, which has been run with lfcShrink using apeglm. In yourkeep
recommendation, is this saying that that 3 or more samples must have counts with X or more? I do think I'd like using that as a cutoff rather than the sum of counts from all samples.Oh I see now, sorry I missed that. Was reading too fast. Then the answer to (1) should have been that we use Bayes formula to provide finite shrunken LFC here.
Well arguably it has very low dispersion and an infinite LFC. The posterior distribution has the mode about 2.5 posterior SDs from 0. So it the uncertainty is coming through in my opinion.
Re: Q2, yes that's what the
keep
code is saying.Thanks again, Michael. Do you know of any literature that has addressed biological relevance in DE hits such as this? If they aren't relevant, then best to have more stringent filtering so they are removed before analysis. My worry is that too many of these types of hits will inflate my gene pool for downstream analysis - such as GOTerm Enrichment.
I don't know of any literature discussing the biological relevance of lowly expressed DE genes vs highly expressed DE genes.
We have data sets with more samples and replicates (6 samples, 5 replicates of each) than the above example. We have issues with very high LFC and padj in some transcripts with only 5-6 samples having non-zero counts (sometimes 100s of them), while DEseq2 ignores very similar values in others. We are arbitrarily using the "keep" filter at various levels. Are you aware of a non-arbitrary, data based method of determining these filter values?
Are you using lfcShrink, or are you referring to MLE LFC? We've written two papers about moderation of LFC improving gene ranking (original DESeq2, and the new apeglm shrinkage estimator).
Thank you for your quick reply. I was not aware of the apeglm paper. We had been trying the approaches suggested in the limma articles, of a cpm based cutoff and a non-zero sample cutoff, but could not find an objective approach to determining these values based on each data set. As mentioned above we can routinely see transcripts where there are 0 values in 80-90% of samples.
Our data is also different from the examples here, in that they are 5 replicates of 6 different experimental conditions (1 control, 5 sequential dilutions). We compare each dilution to the control. Will DEseq2 look at the variation in all 30 samples to estimate the shrinkage values, or only the 2 being compared?
DESeq2 uses all the samples for dispersion estimation and
lfcShrink
-based shrinkage of a pairwise comparison essentially just takes into account the count distribution of the two groups.You can try it out, and see if it looks reasonable for your purposes. We also implemented the ability to use ashr-based shrinkage, there is a section of the DESeq2 vignette that covers shrinkage estimators.
Thank you, we will try both apeglm and the ashr shrinkage methods.
If they aren't successful are there row sum cutoffs, or sample number cutoffs that can be applied through DEseq2? I have posted a small example of what we are seeing, where the first 2 transcripts are listed as ~0 LFC, p=N/A, but the 3rd transcript as fold change 31 and p = 0.000006.
transcript1 0 1 0 0 0 0 0 1 2 0
transcript2 597 686 416 370 835 5 606 546 483 789
transcript3 2 0 0 0 0 0 0 0 0 0
Ok, so aside from lfcShrink methods being a solution, if part of a larger group of samples, the last row may have a low dispersion overall (due to other sample count distributions) and then the infinite MLE LFC. I recommend for such datasets to just filter out genes with low counts (e.g. < 10) for nearly all samples in a particular comparison.
E.g. the following code (and you can use square bracket indexing to enforce more than X samples with count of 10 or more in a comparison of interest before running
DESeq
).I really appreciate your help with this, we've been stuck on this for some time. The data I showed you comes from a full (~25k transcripts, 30 samples) experiment.
Is there a non-arbitrary way to select the count of 10 or sample filter using a a data-based algorithm? The limma help article suggests some values but they don't justify them.
My preferred non-arbitrary solution is to use lfcShrink.
Very good, I will have lfcShrink run and then let you know how it goes. Thanks again.