Question

DE genes with zero counts

0

Entering edit mode

pt • 0

@03a8fba2

Last seen 18 months ago

South Africa

I am performing a DGE analysis on plant tissues under different stresses with or without a separate treatment: moderate (D) and severe drought (S), with or without a separate treatment (T). Each combination has an unstressed/untreated control, e.g. MUU (moderate drought control, unstressed and untreated), MDU (moderate drought stressed, no treatment), MDT etc., for both levels of stress and different tissues. The analysis is simple in the sense that we are only analysing stressed/treated samples versus their control in a particular tissue: MDUL vs. MUUL, MDTL vs. MUUL in leaf, for example (each sample has three biological replicates).

In order to later use lfcShrink() with apeglm it is necessary to use the coefficient name ("group_MDTL_vs_MUUL") as opposed to a contrast ("group","MDTL","MUUL"). Thus, I releveled the data for each comparison as discussed in the vignette and elsewhere:

dds=DESeqDataSetFromTximport(txi,colData=sampletable,design=~group)
rdds=DESeq(dds)
rdds$group=relevel(rdds$group,ref="MUUL")
rdds=nbinomWaldTest(rdds)
resultsNames(rdds)
[1] "Intercept" ... "group_MDTL_vs_MUUL"
res=results(rdds,name="group_MDTL_vs_MUUL",alpha=0.05,lfcThreshold=1,altHypothesis="greaterAbs")
res_shrink=lfcShrink(dds=rdds,coef="group_MDTL_vs_MUUL",res=res,type="apeglm")

When I was analysing the output for the comparisons I noticed several instances where a gene was called DE despite intuitively seeming like the counts/number of samples with non-zero counts was too low. This included several instances where a gene had literally zero counts in both of the compared samples (bearing in mind the 0.5 pseudocount from plotCounts):

$ res["Solyc00g500050.1.1",]
log2 fold change (MLE): group MDTL vs MUUL 
Wald test p-value: group MDTL vs MUUL 
DataFrame with 1 row and 6 columns
                    baseMean log2FoldChange     lfcSE      stat      pvalue
                   <numeric>      <numeric> <numeric> <numeric>   <numeric>
Solyc00g500050.1.1   31.6121       -20.7792   5.10032  -3.87803 0.000105307
                         padj
                    <numeric>
Solyc00g500050.1.1 0.00381324

$ plotCounts(rdds,"Solyc00g500050.1.1",intgroup="group",return=TRUE) %>% filter(group == "MDTL" | group == "MUUL")
      count group
MDTL1   0.5  MDTL
MDTL2   0.5  MDTL
MDTL3   0.5  MDTL
MUUL1   0.5  MUUL
MUUL2   0.5  MUUL
MUUL3   0.5  MUUL

Using contrast rather than name seemed to prevent these genes being given significant p-values, as did using lfcShrink and filtering genes by s-value instead. I suspect the issue may be related to the highly variable expression of genes across the tissues and conditions, since a gene may be highly expressed in some tissue/condition and zero in many others? This also makes it difficult to pre-filter such genes.

1) Are there any major errors in the pipeline above?

2) Is there any way to avoid identifying such genes, other than a post-hoc filter?

Thanks!

DESeq2 • 910 views

ADD COMMENT • link updated 23 months ago by Michael Love 41k • written 23 months ago by pt • 0

1

Entering edit mode

Did you perform pre-filtering of low counts genes before differential analysis ? Have a look at the DESeq2 vignette on the pre-filtering : http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

ADD REPLY • link 23 months ago Basti ▴ 780

0

Entering edit mode

Yes, I have done it without and without pre-filtering. The problem, which I don't really make clear above, is that these genes are robustly expressed in other conditions and tissues and so have a reasonable baseMean. I would not expect them to be filtered out of the dataset entirely, but I would have expected a gene with no counts in two samples to not be then measured as DE between those two samples (at least, which is what appears to be happening when using the coefficient name in results rather than contrast).

ADD REPLY • link 23 months ago pt • 0

score 1 · Accepted Answer · 2022-06-08

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 8 hours ago

United States

The LFC is unreliable when both groups are 0, and likewise the SE (both are undefined). The solution is to either use contrast or use lfcShrink.

ADD COMMENT • link 23 months ago Michael Love 41k

0

Entering edit mode

Hi Michael, thanks!

I'm less concerned with the LFC per se - indeed, the shrunken LFCs for these genes as produced by lfcShrink are ~0, as expected (unless you mean attempting to test against LFC using lfcThreshold would have a weakness to this?). I was more worried that the fact that they were given a significant padj value at all, and whether this was indicative of an underlying issue that I might miss if simply switching to the alternative lfcShrink method producing s-values.

If not I can just do that, as suggested.

ADD REPLY • link 23 months ago pt • 0

0

Entering edit mode

The Wald p-value breaks down at the boundary of the parameter space 0 vs 0 -- you can either use contrast, lfcShrink or test="LRT" to avoid this, although we don't implement contrast-based LRT in DESeq2, so you can use the first two solutions.

ADD REPLY • link 23 months ago Michael Love 41k