Search
Question: modeling zero-dominated RNA-seq with voom/limma and hurdle models (pscl)
1
gravatar for osa.zuta
2.2 years ago by
osa.zuta40
United States
osa.zuta40 wrote:

Hello, 

Two questions

a) can one reasonably use voom/limma to handle single cell RNA-seq data, which are zero-dominated? Naive application seems to yield reasonable results. 

b) it seems more appropriate to try to model the process as a hurdle (logit zero  counts, negative binomial finite counts) or zero -dominated negative binomial process, however the pscl package seems to not really handle a full count *matrix*. If anyone has a small working example of hurdle pscl functions in this context that'd be great. 

thanks to all,

zo

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by osa.zuta40

Have any of you looked at the prior weights when limma is applied to single cell data? In other words, how different are the results if you just run a Lognormal gene-specific model w/o shrinkage? Single cell has a larger sample size than bulk RNA, so I wonder if shrinkage becomes less necessary or maybe even redundant for single cell.

ADD REPLYlink written 12 weeks ago by Nik Tuzov60

I suggest you post a new question.

ADD REPLYlink written 12 weeks ago by Aaron Lun17k
4
gravatar for Aaron Lun
2.2 years ago by
Aaron Lun17k
Cambridge, United Kingdom
Aaron Lun17k wrote:

The presence of many zero counts means that the normal model used by voom and limma will not be appropriate. In addition, the mean-variance trend tends to be rather strange when counts are low and discrete (e.g., a sharp dip in the variance at the left of the plot). This usually isn't a problem for routine applications of voom, because such counts are filtered out in bulk RNA-seq analyses. However, they can't be avoided for single-cell data. As a result, I don't think you'd get sensible precision weights for those observations.

Anecdotes from colleagues suggest that the quasi-likelihood (QL) framework in edgeR is more appropriate for single-cell RNA-seq data. The negative binomial (NB) distribution explicitly accounts for zero counts, while the QL framework handles the variability of single cell data. Try the glmQLFit and glmQLFTest functions and see if they work for you.

As a side note, I've also tried to model some single-cell data with a zero-inflated NB distribution. In my experience, though, the estimated NB dispersion is so large that the simple NB distribution directly generates more than enough zeroes without the need for further inflation.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Aaron Lun17k
2
gravatar for Gordon Smyth
2.2 years ago by
Gordon Smyth32k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth32k wrote:

To reinforce Aaron's answer:

You might well get reasonable results from limma/voom with single cell data, but I am uneasy about it. My lab uses edgeR quasi-likelihood for single cell data, via the functions that Aaron mentions. This has a lot of similarities with limma, but has more integrated treatment of zero counts.

5 months later. We will be submitting our first single-cell RNA-seq article in a month or two, and we have used edgeR-glm rather than voom or edgeR-QL.

ADD COMMENTlink modified 21 months ago • written 2.2 years ago by Gordon Smyth32k

Could you link the paper please?

ADD REPLYlink written 12 weeks ago by Nik Tuzov60
1
gravatar for osa.zuta
2.2 years ago by
osa.zuta40
United States
osa.zuta40 wrote:

Thanks alot Aaron and Gordon. I've actually tried the QL functions you cite and have found reasonable counterexamples to both the voom and QL DEG calls, i.e. cases where both the QL edgeR calls and voom/limma calls are clearly wrong, but also very high overlap in DEG calls at most FDR cutoffs (though trending towards worse overlap with lower FDR cutoff). This is puzzling. I agree that voom/limma can't be reasonably expected to model a nonlinear bimodal expression process as that in the sc data, but I've even validated some logFC predictions made v/l via qPCR. 

ADD COMMENTlink written 2.2 years ago by osa.zuta40
1

Make sure you set robust=TRUE in glmQLFit. This ensures that genes with high variability between replicates will have high QL dispersions. By default, empirical Bayes shrinkage will drag the QL dispersions downwards towards the mean-dispersion trend, which is not appropriate for genes with highly variable counts.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Aaron Lun17k

Also, what's your definition of "clearly wrong"?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Aaron Lun17k

logFC verified or rejected by qPCR

ADD REPLYlink written 2.2 years ago by osa.zuta40

Hmm... well, the log-fold change calculation is pretty simple in both limma and edgeR (addition of a prior count aside), so there's not really a great deal that can go wrong. One possibility is that the discrepancy with the qPCR data is caused by the high technical variability of low scRNA-seq counts. This results in an imprecise estimate of the log-FC (which, to split some hairs, isn't technically an incorrect estimate). If such imprecision is the cause of the discrepancy, then I don't see how any calculation would be able to reliably yield an log-fold change estimate close to the true value. A better assessment of limma and edgeR would involve checking whether the methods control the error rate properly, i.e., whether the significant genes are actually DE. After all, that's what all the variance modelling is for.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Aaron Lun17k

Hi Aaron, 

I'm modifying an earlier query about AveLogCPM reported by estimateDisp -- I was confused about how aveLogCPM and the lower-level mglmOneGroup function worked and whether or not they are appropriate for scRNA data, which are zero dominated. The computed average abundances for most genes in most cells are quite large (AveLogCPM > 10, etc.), which makes it tough to really tell what is lowly expressed and what is highly expressed. 

zo

ADD REPLYlink modified 2.2 years ago by Gordon Smyth32k • written 2.2 years ago by osa.zuta40

AveLogCPM doesn't have any particular problem with zeros. In any case, it is hard to think of any other measure that could be better.

I think you may have mis-understood. AveLogCPM computes a average value for each gene, so it does not give values "for most genes in most cells" as your question refers to.

I don't see any problem with most of the values being > 10, although that doesn't match sc data that I have seen.

ANYWAY: if you want to ask us so many questions about edgeR, could you please start a new question with an appropriate title instead of just adding to this thread? This thread is about limma/voom.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Gordon Smyth32k

hi gordon, 

thanks! I'll make a new question for edgeR, but I do have one in trying to compare AveLogCPM computed with the QL DEG calls of edgeR with AveExpr (topTable) from a naive application of voom on sc RNA data. Like I said at the beginning, there is good overlap, a reasonable MA plot from plotMA, and reasonable DEG results from the voom/limma pipeline, which is counterintuitive.  However, the AveExpr values are just the average of v$E across all samples for a given gene, which is roughly the average of log2(cpms) (mod dge$samples$norm.factors), and thus has negative values and is general much smaller than aveLogCPM ~ log2(rowMeans(dge$counts)) for a given gene. This confused me, and so I was kind of asking which scale makes more sense in the sc context since plotSmear working off of glmQLFTest objects uses aveLogCPM values in its MA plot. What confused me is what dispersion aveLogCPM should be working with in the sc context, since you'd think it should be high and thus aveLogCPM return roughly rowMeans(cpm(dge, log = T)). I'd show you my 

plot( voom_scRNA[idx,]$AveExpr, edgeR_scRNA[idx,]$logCPM) 

but don't know how to attach....it's concave in form.

ADD REPLYlink modified 21 months ago by Gordon Smyth32k • written 2.2 years ago by osa.zuta40

I'm not convinced that you can so easily verify sc RNA-seq results by qPCR, since you can't qPCR the same cells.

qPCR also has its own innaccuracies.

ADD REPLYlink written 2.2 years ago by Gordon Smyth32k
1
gravatar for osa.zuta
2.2 years ago by
osa.zuta40
United States
osa.zuta40 wrote:

Yes, I had set robust = T in glmQLFit already.  In fact, I find it difficult to understand why this setting doesn't affect the MA plot all that much. Would it be too much trouble for you to post a typical MA plot you obtain from scRNA data so I can compare with mine? I'd attach mine but don't know how (new to this forum with bad connection). 

ADD COMMENTlink written 2.2 years ago by osa.zuta40
1

Setting robust=TRUE shouldn't affect the MA plot at all, as it won't change the log-fold change or average abundances of the genes; this parameter only affects the QL dispersion estimates (specifically, the EB shrinkage thereof).

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Aaron Lun17k
0
gravatar for osa.zuta
2.2 years ago by
osa.zuta40
United States
osa.zuta40 wrote:

 I guess I don't completely understand your comment so I'll have to think about it  some more. It's difficult to understand the regime of very high tagwise and common dispersion -- that's part of the problem. 

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by osa.zuta40

I don't know why you're talking about the log-fold change for plotQLDisp which, as its name suggests, plots the QL dispersions, not the log-fold change. In any case, moving on; a decrease in the NB dispersion with increasing abundance is observed for most RNA-seq data analyses. This is because large counts tend to be more stable/precise (it's got nothing to do with logarithms, as we don't log-transform the counts). In most analyses, this trend is fully modelled by the mean-dispersion trend that is fitted in estimateDisp. As such, the QL dispersion trend from glmQLFit is expected to be flat and around unity.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Aaron Lun17k
0
gravatar for osa.zuta
2.2 years ago by
osa.zuta40
United States
osa.zuta40 wrote:

I realize my comment was confusing and revised it as you answered -- all I meant is that the QL estimates inform which logFC are significant. 

( All I mean with the log was that for NB counts c, var(c) ~ mu + d*mu^2, where mu  = average count over biological replicates, d dispersion. Then we can write var(log(cpm)) as approximatelyT/mu + B, where T is technical and B is biological variance. So, yes, the log overdamps the cpm to B for large mu. I think we're talking past each other. ) 

 

Anyway, I'm just stuck interpreting the plotBCV and plotQLDisp and the huge common and tagwise trends they display....i.e. my plotQLDisp with robust = T is a linear function ranging from 0.6 to 1.0 in deviance over 0 to 10 average log2cpm. So just to be concrete, here's what I'm doing: 

 

d_QL <- estimateDisp(dge, design)

#robust = T 
fit_QL <- glmQLFit(d_QL, design, robust = T)

#replace likelihood tests with quasy likelihood F-tests for coefficients in the linear model (so just like glmLRT)

res_QL <- glmQLFTest(fit_QL)

plotBCV(d_QL)

plotQLDisp(fit_QL)

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by osa.zuta40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 139 users visited in the last hour