Question: ZINB-WaVE + DESeq2 misses low-expressed, single-condition genes that look like they should be markers
4
7 months ago by
rory.kirchner60 wrote:

Hey all,

We've been running ZINB-WaVE and DESeq2 on an inDrop dataset to look for differentially expressed genes across subsets of cells. This does a nice job for genes that are more highly expressed, but it looks like this can miss genes that look like they are markers but are expressed at a low level in just one condition.

If I select genes with a log2FoldChange > 10 and plot the baseMean distribution, there's a long tail of genes with a higher baseMean. For example:

Selecting the long-tail genes there (baseMean > 0.3) and jitter-plotting the counts for the two conditions I am testing:

Turning off the Cook's distance filtering doesn't help any, the p-values for these samples are ~1. I get why these aren't being called, the lfcSE is like 10x the log2FoldChange:

        baseMean log2FoldChange     lfcSE       stat    pvalue      padj
<numeric>      <numeric> <numeric>  <numeric> <numeric> <numeric>
Sulf2  0.3104184       13.87955  292.0464 0.04752516 0.9622228        NA
Atp1a2 0.4537823       14.11711  290.0398 0.04867300 0.9612700         1
Csf1   0.3267646       15.11244  282.0484 0.05358100 0.9573912         1
H2afy  0.3259195       16.07577  274.9044 0.05847767 0.9535017         1
Mknk2  0.4025375       17.26996  266.7222 0.06474887 0.9484551         1
...          ...            ...       ...        ...       ...       ...
Dner   0.6504387       14.05085  290.5956 0.04835190 0.9615246         1
Akap12 0.8202702       13.61820  294.3050 0.04627242 0.9631704         1
Col6a3 0.3008975       11.87571  310.7945 0.03821081 0.9696493        NA
Pi15   0.7426169       14.92787  283.4807 0.05265920 0.9580888         1
Marcks 0.3847278       14.81474  284.3462 0.05210108 0.9585412         1

These are also pretty unbalanced and small groups, there's only 54 cells in the mutant_anterior group and 17 in the wt_anterior group and on average ~25% of the cells in the mutant_anterior group express the gene. Despite all that, these do kind of look like they should be called to me. What do you all think?

modified 7 months ago by Michael Love22k • written 7 months ago by rory.kirchner60
Answer: ZINB-WaVE + DESeq2 misses low-expressed, single-condition genes that look like t
5
7 months ago by
Michael Love22k
United States
Michael Love22k wrote:

hi Rory,

Thanks, this is really interesting.

It's funny from dataset to dataset, and depending on application, these kinds of rows will either be desired to be detected or not detected. The other week someone was saying that DESeq2 was too sensitive to DE when it was only a subset of samples (but too many to call them outliers) in one of the groups that had much higher expression, not the entire group.

Are you using LRT? That's recommended for scRNA-seq data. I'm curious if it's to do with zinbwave weighting or it happens for DESeq2 alone. Could you try that? The weights for the all zero group could all be close to zero and causing a problem (but not low enough to trigger the weights check).

Also curious if row-wise Wilcoxon would get these.

At Bioc2018, someone was saying that UMI counts might not require zero-inflated component, but I forget who said this and what was the evidence or paper that backed it up.

1

I think Wolfgang Huber mentioned that about UMI counts right? I'm running a DE analysis using zimbwave and will look out for this too. Thanks Rory.

2

I agree with Mike's suggestions, and would check those first. In addition, could you also let us know how the weights look like for the zeros of those specific genes? Can you share some of your code, e.g. did you include the treatment effect in the ZINB-WaVE model?

Note that in our simulations, we also did not find substantial benefits to account for zero inflation in UMI data: "In general, bulk RNA-seq methods perform well in this simulation, probably because the extremely high zero abundance in combination with low counts can be reasonably accommodated by the negative binomial distribution".
In my experience, accounting for zero inflation is particularly useful for data from full-length protocols like SMART-Seq2.

EDIT: external research (https://www.biorxiv.org/content/early/2018/08/17/225177) seems to suggest that the benefit depends on the dataset, also for UMI data, and we will be looking into diagnostics for this.

Hi everyone, to answer everyone's questions:

I ran ZINB-WaVE with all ~1700 cells in the experiment, and then subset down to a group I was interested in to do DE across conditions within the different subgroups. I fit it like this:

design = model.matrix(~1+genoregion, data=colData(se))
epsilon_de = 1e12
zinb = zinbFit(se, K=0, epsilon=epsilon_de, BPPARAM=BiocParallel::SerialParam())
counts_zinb = zinbwave(se, fitted_model = zinb, K = 2, epsilon=1000)

I ran with LRT and doing the Wald test, and these genes aren't captured either way, the same low expressing but condition-specific genes that look like they should be hits to me get missed. I also tried doing the LRT but not using the weights at all, and the same genes get missed. For all of these, I subset down the DESeq2 object like this to do the group-wise tests:

group0 = dds[, rownames(subset(colData(dds), res.0.8 == 0))]
group0$genoregion = droplevels(group0$genoregion)
table(colData(group0)\$genoregion)
group0 = DESeq(group0, sfType="poscounts", useT=TRUE, minmu=1e-6)
group0anterior = results(group0, contrast=c("genoregion", "mutant_anterior", "wt_anterior"))

which ends up having the dispersions recalculated using only the subset of cells, which may be where I am losing some power. I can provide all the data and the reports if that would be helpful.

I'm surprised that LRT doesn't capture this without the weights, but can't figure out why that would be.

What if you do, after subsetting:

dds.sub <- nbinomLRT(dds.sub, reduced=~1)
res <- results(dds.sub)

Thanks-- I'm an idiot, I was grabbing from the weighted dds object when doing the test instead of the one I created from just the counts. You're right that DESeq2 without the ZINB-WaVE weights catches most of these.

ddsraw = DESeqDataSet(se, design=~1+genoregion)
group0NW = ddsraw[, rownames(subset(colData(ddsraw), res.0.8 == 0))]
group0NW = DESeq(group0NW, test="LRT", full=~1+genoregion, reduced=~1,
sfType="poscounts", minmu=1e-6, minRep=Inf)
group0NWanterior = results(group0NW,
contrast=c("genoregion", "mutant_anterior", "wt_anterior"),
cooksCutoff=FALSE)



No worries. I catch myself doing a dozen slips like this every day before noon. How can you characterize the ones that still aren't caught (and are these missed by Wilcoxon also)?

Ok, it's interesting that the weights trip up the inference on these particular genes, and I'll be interested to see how the treatment design approach that Koen mentions works.

row-wise Wilcoxon does get most of these but only if I grab all the high LFC change genes with the baseMean cutoff in order to not lose them during multiple hypothesis testing correction. True here is padj < 0.05, using BH correction.

Thank you for checking this. In the zinbFit function, could you specify your treatment design as well? This will allow the zero excess probability to systematically vary with the treatment. Also, we recommend setting commondispersion=TRUE.

In the meanwhile, I would be happy to take a look at the data, if you could upload it somewhere or send it via email to koen.vandenberge@ugent.be

3

Thanks everyone for the awesome discussion. I put up a github repository here (https://github.com/roryk/zinbwave-deseq2-indrop) that has an rmarkdown report summarizing all of this. In the report are links to the raw data, I put up Seurat object I started with, and a script to calculate the weights and save a DDS object with and without the weights.