modeling zero-dominated RNA-seq with voom/limma and hurdle models (pscl)
6
1
Entering edit mode
osa.zuta ▴ 40
@osazuta-8625
Last seen 4.4 years ago
United States

Hello,

Two questions

a) can one reasonably use voom/limma to handle single cell RNA-seq data, which are zero-dominated? Naive application seems to yield reasonable results.

b) it seems more appropriate to try to model the process as a hurdle (logit zero  counts, negative binomial finite counts) or zero -dominated negative binomial process, however the pscl package seems to not really handle a full count *matrix*. If anyone has a small working example of hurdle pscl functions in this context that'd be great.

thanks to all,

zo

limma voom hurdle single cell • 4.6k views
0
Entering edit mode

Have any of you looked at the prior weights when limma is applied to single cell data? In other words, how different are the results if you just run a Lognormal gene-specific model w/o shrinkage? Single cell has a larger sample size than bulk RNA, so I wonder if shrinkage becomes less necessary or maybe even redundant for single cell.

0
Entering edit mode

I suggest you post a new question.

5
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 21 hours ago
The city by the bay

The presence of many zero counts means that the normal model used by voom and limma will not be appropriate. In addition, the mean-variance trend tends to be rather strange when counts are low and discrete (e.g., a sharp dip in the variance at the left of the plot). This usually isn't a problem for routine applications of voom, because such counts are filtered out in bulk RNA-seq analyses. However, they can't be avoided for single-cell data. As a result, I don't think you'd get sensible precision weights for those observations.

Anecdotes from colleagues suggest that the quasi-likelihood (QL) framework in edgeR is more appropriate for single-cell RNA-seq data. The negative binomial (NB) distribution explicitly accounts for zero counts, while the QL framework handles the variability of single cell data. Try the glmQLFit and glmQLFTest functions and see if they work for you.

As a side note, I've also tried to model some single-cell data with a zero-inflated NB distribution. In my experience, though, the estimated NB dispersion is so large that the simple NB distribution directly generates more than enough zeroes without the need for further inflation.

3
Entering edit mode
@gordon-smyth
Last seen 2 hours ago
WEHI, Melbourne, Australia

You might well get reasonable results from limma/voom with single cell data, but I am uneasy about it. My lab uses edgeR quasi-likelihood for single cell data, via the functions that Aaron mentions. This has a lot of similarities with limma, but has more integrated treatment of zero counts.

5 months later. We will be submitting our first single-cell RNA-seq article in a month or two, and we have used edgeR-glm rather than voom or edgeR-QL.

0
Entering edit mode

0
Entering edit mode

Would you recommend using filterByExpr() for single-cell data?

0
Entering edit mode

Depends on the purpose of the DE analysis. I use filterByExpr() when I'm testing for differences between conditions in a replicated multi-condition scRNA-seq study, as detailed in https://osca.bioconductor.org/multi-sample-comparisons.html. This mimics a formal DE analysis of bulk RNA-seq data so the same conventions apply. However, if I'm just looking for markers, I don't bother performing any gene filtering prior to the DE, as the benefits of filtering are much less relevant. (I'm not doing any empirical Bayes, I've already normalized, and I only care about the ranking and not about the size of the p-values.) It's also much harder to decide upon a "reasonable" filter threshold for single-cell data, e.g., an average count of 1 in droplet data is pretty high, but would get immediately discarded by filterByExpr() defaults.

1
Entering edit mode
osa.zuta ▴ 40
@osazuta-8625
Last seen 4.4 years ago
United States

Thanks alot Aaron and Gordon. I've actually tried the QL functions you cite and have found reasonable counterexamples to both the voom and QL DEG calls, i.e. cases where both the QL edgeR calls and voom/limma calls are clearly wrong, but also very high overlap in DEG calls at most FDR cutoffs (though trending towards worse overlap with lower FDR cutoff). This is puzzling. I agree that voom/limma can't be reasonably expected to model a nonlinear bimodal expression process as that in the sc data, but I've even validated some logFC predictions made v/l via qPCR.

1
Entering edit mode

Make sure you set robust=TRUE in glmQLFit. This ensures that genes with high variability between replicates will have high QL dispersions. By default, empirical Bayes shrinkage will drag the QL dispersions downwards towards the mean-dispersion trend, which is not appropriate for genes with highly variable counts.

0
Entering edit mode

Also, what's your definition of "clearly wrong"?

0
Entering edit mode

logFC verified or rejected by qPCR

0
Entering edit mode

Hmm... well, the log-fold change calculation is pretty simple in both limma and edgeR (addition of a prior count aside), so there's not really a great deal that can go wrong. One possibility is that the discrepancy with the qPCR data is caused by the high technical variability of low scRNA-seq counts. This results in an imprecise estimate of the log-FC (which, to split some hairs, isn't technically an incorrect estimate). If such imprecision is the cause of the discrepancy, then I don't see how any calculation would be able to reliably yield an log-fold change estimate close to the true value. A better assessment of limma and edgeR would involve checking whether the methods control the error rate properly, i.e., whether the significant genes are actually DE. After all, that's what all the variance modelling is for.

0
Entering edit mode

Hi Aaron,

I'm modifying an earlier query about AveLogCPM reported by estimateDisp -- I was confused about how aveLogCPM and the lower-level mglmOneGroup function worked and whether or not they are appropriate for scRNA data, which are zero dominated. The computed average abundances for most genes in most cells are quite large (AveLogCPM > 10, etc.), which makes it tough to really tell what is lowly expressed and what is highly expressed.

zo

0
Entering edit mode

AveLogCPM doesn't have any particular problem with zeros. In any case, it is hard to think of any other measure that could be better.

I think you may have mis-understood. AveLogCPM computes a average value for each gene, so it does not give values "for most genes in most cells" as your question refers to.

I don't see any problem with most of the values being > 10, although that doesn't match sc data that I have seen.

0
Entering edit mode

hi gordon,

thanks! I'll make a new question for edgeR, but I do have one in trying to compare AveLogCPM computed with the QL DEG calls of edgeR with AveExpr (topTable) from a naive application of voom on sc RNA data. Like I said at the beginning, there is good overlap, a reasonable MA plot from plotMA, and reasonable DEG results from the voom/limma pipeline, which is counterintuitive.  However, the AveExpr values are just the average of v$E across all samples for a given gene, which is roughly the average of log2(cpms) (mod dge$samples$norm.factors), and thus has negative values and is general much smaller than aveLogCPM ~ log2(rowMeans(dge$counts)) for a given gene. This confused me, and so I was kind of asking which scale makes more sense in the sc context since plotSmear working off of glmQLFTest objects uses aveLogCPM values in its MA plot. What confused me is what dispersion aveLogCPM should be working with in the sc context, since you'd think it should be high and thus aveLogCPM return roughly rowMeans(cpm(dge, log = T)). I'd show you my

plot( voom_scRNA[idx,]$AveExpr, edgeR_scRNA[idx,]$logCPM)

but don't know how to attach....it's concave in form.

0
Entering edit mode

I'm not convinced that you can so easily verify sc RNA-seq results by qPCR, since you can't qPCR the same cells.

qPCR also has its own innaccuracies.

1
Entering edit mode
osa.zuta ▴ 40
@osazuta-8625
Last seen 4.4 years ago
United States

Yes, I had set robust = T in glmQLFit already.  In fact, I find it difficult to understand why this setting doesn't affect the MA plot all that much. Would it be too much trouble for you to post a typical MA plot you obtain from scRNA data so I can compare with mine? I'd attach mine but don't know how (new to this forum with bad connection).

1
Entering edit mode

Setting robust=TRUE shouldn't affect the MA plot at all, as it won't change the log-fold change or average abundances of the genes; this parameter only affects the QL dispersion estimates (specifically, the EB shrinkage thereof).

0
Entering edit mode
osa.zuta ▴ 40
@osazuta-8625
Last seen 4.4 years ago
United States

I guess I don't completely understand your comment so I'll have to think about it  some more. It's difficult to understand the regime of very high tagwise and common dispersion -- that's part of the problem.

0
Entering edit mode

I don't know why you're talking about the log-fold change for plotQLDisp which, as its name suggests, plots the QL dispersions, not the log-fold change. In any case, moving on; a decrease in the NB dispersion with increasing abundance is observed for most RNA-seq data analyses. This is because large counts tend to be more stable/precise (it's got nothing to do with logarithms, as we don't log-transform the counts). In most analyses, this trend is fully modelled by the mean-dispersion trend that is fitted in estimateDisp. As such, the QL dispersion trend from glmQLFit is expected to be flat and around unity.

0
Entering edit mode
osa.zuta ▴ 40
@osazuta-8625
Last seen 4.4 years ago
United States

I realize my comment was confusing and revised it as you answered -- all I meant is that the QL estimates inform which logFC are significant.

( All I mean with the log was that for NB counts c, var(c) ~ mu + d*mu^2, where mu  = average count over biological replicates, d dispersion. Then we can write var(log(cpm)) as approximatelyT/mu + B, where T is technical and B is biological variance. So, yes, the log overdamps the cpm to B for large mu. I think we're talking past each other. )

Anyway, I'm just stuck interpreting the plotBCV and plotQLDisp and the huge common and tagwise trends they display....i.e. my plotQLDisp with robust = T is a linear function ranging from 0.6 to 1.0 in deviance over 0 to 10 average log2cpm. So just to be concrete, here's what I'm doing:

d_QL <- estimateDisp(dge, design)

#robust = T
fit_QL <- glmQLFit(d_QL, design, robust = T)

#replace likelihood tests with quasy likelihood F-tests for coefficients in the linear model (so just like glmLRT)

res_QL <- glmQLFTest(fit_QL)

plotBCV(d_QL)

plotQLDisp(fit_QL)