Question: single cell data: adding random effects for subject with multilevel experiment design
1
6 months ago by
University of Cambridge, UK | NIH Bethesda, MD
Matt Mulé0 wrote:

Hi, wondering if anyone has thought about the best way to do single cell DE testing including subject as a random effect in a multilevel design including many subjects contrasting timpoint and disease state effects within a single "cluster"

For example, consider data similar to the design in the Limma guide section 9.7, except observations are cells, there are n cells per timepoint for each subject. I'm only looking at cells from a single cell type ("cluster" if you like), within that cluster I have many subjects with 2 timepoints and 2 states, say, diseased and healthy.

If I convert my SingleCellExperiment object into a DGElist, using y = convertTo(sce, type = "edgeR") I can test the contrast of e.g. the delta between timepoint 1 and 2 in diseased vs the delta between timepoint 1 and 2 in healthy subjects without including the intra-subject correlation in the model as a random effect e.g.

subject = c("sub1", "sub1", "sub2", "sub2", "sub3", "sub3",
"sub4", "sub4", "sub5", "sub5", "sub6", "sub6")
condition = c(rep("diseased", 6), rep("normal", 6))
timepoint = c(rep(c("1","2"),6))

model = cbind(subject,condition,timepoint) %>%
as.data.frame() %>%
mutate_if(is.character, as.factor) %>%
unite("condition_time", c("condition", "timepoint"))

ct.matrix = model.matrix(~0 + model$condition_time + model$subject)

y = convertTo(sce, type = "edgeR")
y$samples = cbind(y$samples, model)

v <- voom(y, ct.matrix)
fit <- lmFit(v, ct.matrix)

contrast_matrix  =
makeContrasts(t1_disease_vs_normal =
(condition_timediseased_2 - condition_timediseased_1) -condition_timenormal_2 - condition_timenormal_1))

c_fit = contrasts.fit(fit = fit, contrasts = contrast_matrix)
eb_c_fit = eBayes(c_fit)


In this case though, the sce object has many replicates for each subject-timepoint combination where the replicates are cells of a single cell type.

Question 1 It seems that it's not possible to add subject as a random effect in edgeR https://support.bioconductor.org/p/87292/
So I would have to embed single cell data into an elist object...and not use voom / edgeR in order to accomodate subject as a random effect? For example I would have to create an list "y" with single cells as observations and use :

corfit = duplicateCorrelation(y, response_time_matrix, block=design$subject) corfit$consensus


I'm not sure this is advisable as I'd like to include the size factors calculated by scran as normalization factors for use in voom. One alternate idea is to summarize the scran normalized library by averaging each subject's cells at a given timepoint. Thus you would get a "meta-cell" from cluster k, subject j, timepoint t, of size n after normalizing out cell-specific differences like cell size, capture efficiency differences between cells with scran... this is a slightly different philosophy than using a "pseudobulk" library pool (as in section 6 of the DE testing vignette in the SimpleSingleCell workflow) of all cells as it avoids overrepresentation biases from e.g. large cells in the sample pool.

Question 2 Even if I did this, is there a way I could use such averaged summarized data into a linear model with random effects in limma by embedding the data into a elist object for limma?

This level of model complexity is uncommon in single cell analysis but I thought it would be interesting to look at.

-matt

modified 6 months ago by Aaron Lun25k • written 6 months ago by Matt Mulé0
Answer: single cell data: adding random effects for subject with multilevel experiment d
2
6 months ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

Question 1: With respect to duplicateCorrelation, we explored this approach (among others) some time ago. I would say that it wasn't worth it. Running duplicateCorrelation() across hundreds of samples takes a long time without providing clear benefits with respect to type I error control.

With respect to making your suggested "meta-cells": it is debatable whether or not you want to normalize before summing. On one hand, it ensures that each cell contributes equally to the resulting meta-cell; but on the other hand, large cells already contribute more RNA in real bulk libraries, and that doesn't cause any real dramas in interpretation. I can't say one is better than the other, only that they are different. What I will say, though, is that summing normalized counts has less predictable statistical properties than summing the raw counts. The latter sum still is a count, has a lower bound of Poisson variance, and can be reasonably modeled with a NB distribution; while anything goes with the former.

For the record, there is a third option of weighting the pseudo-bulk samples by the number of cells used in its summation. These weights can, in theory, be used in edgeR's GLM fitting to account for differences in precision for each pseudo-bulk sample. However, weighting assumes that all cells were i.i.d. NB distributed within each group, which is obviously untrue. As the number of cells increase, the inter-sample variation dominates over the imprecision of each sample, and the violations of the i.i.d. assumption become more obvious.

Question 2: Well, you could just pretend it's a matrix from bulk RNA-seq, and get limma to treat it as such. Don't take this as a recommendation, though.

I would like to clarify one misunderstanding I had about the de testing ecosystem in bioconductor to help other users: In order to use limma with count data, one can create a DGElist object from the edgeR package, which uses limma as a dependency. You can convert a singleCellExperiment object into a DGElist with the convertTo function from scran. 3 common options for DE testing are limma-voom, DESeq and edgeR. One can implement limma-voom on an edgeR object; rather than using a NB model for the count data like DESeq and edgeR, voom estimates the mean variance relationship on log2 CPM transformed data calculated from normalization factors (e.g. size factors from scran if using single cell data).

To implement a random effects model with multilevel experiment designs in limma, one must use the duplicatecorrelation function (see p 49 of limma manual). The duplicatecorrelation function does not run directly on the edgeR object. Reading this post: https://support.bioconductor.org/p/87292/ I thought you could not run random effect models with the count data in limma, but I misunderstood, it's not possible to use duplicatecorrelation within the edgeR NB model fitting framework. This is explained well here: https://support.bioconductor.org/p/60460/

One can add random effects within the voom model fitting framework by running duplicateCorrelation on the object returned by voom To my knowledge, this would be the only way to include random effects in single cell de testing. The complexity actually arises in when how many times you should implement duplicatecorrelation which is explained well here: https://support.bioconductor.org/p/59700/

For a comparison or performance of different DE strategies I refer other users to this nice mark Robinson paper https://www.nature.com/articles/nmeth.4612 limma-voom seemed to perform well on single cell data especially after appropriate gene filtering and I am seeing the same thing with my data. Heres a quick snippet to only include genes with a count of 1 in at least 10% of cells in a log normalized matrix "lncount" similar to what they did in the paper.

genes_keep = apply(lncount, 1, function(x) {
length(which(x >= 1)) / length(x)})
genes_use = genes_keep[genes_keep > 0.1] %>% names


Thanks for the reference to your paper. I will need more time to digest the supplementary material, but I think my experiment is slightly different in that individual variation which in your paper is analogous to the plate effect is (possibly) larger than the group effect being tested; I'm not sure we can get a feel for type I error in our system without seeing how the biological heterogeneity between individuals impacts the model by including it in some test cases. To your point though, I see what you mean about the time needed to run duplicateCorrelation. Thanks for the thoughts on the averaging vs summation approaches. My colleagues in the lab have had success using an average approach with single cell qPCR data see here:
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005016 https://www.cell.com/fulltext/S2405-4712(17)30087-X

Your point about count data precision increasing as a function of number of cells summed is nicely illustrated in the paper you sent. As you mentioned I think whether or not to have equal contribution of each cell by averaging will depend on the biological question at hand, with summation of bulk counts into a pseudo library being more analagous to an actual RNAseq library of the pooled cells. thanks again -matt