Hi, wondering if anyone has thought about the best way to do single cell DE testing including subject as a random effect in a multilevel design including many subjects contrasting timpoint and disease state effects within a single "cluster"

For example, consider data similar to the design in the Limma guide section 9.7, except observations are cells, there are n cells per timepoint for each subject. I'm only looking at cells from a single cell type ("cluster" if you like), within that cluster I have many subjects with 2 timepoints and 2 states, say, diseased and healthy.

If I convert my SingleCellExperiment object into a DGElist, using `y = convertTo(sce, type = "edgeR")`

I can test the contrast of e.g. the delta between timepoint 1 and 2 in diseased vs the delta between timepoint 1 and 2 in healthy subjects *without including the intra-subject correlation in the model as a random effect* e.g.

```
subject = c("sub1", "sub1", "sub2", "sub2", "sub3", "sub3",
"sub4", "sub4", "sub5", "sub5", "sub6", "sub6")
condition = c(rep("diseased", 6), rep("normal", 6))
timepoint = c(rep(c("1","2"),6))
model = cbind(subject,condition,timepoint) %>%
as.data.frame() %>%
mutate_if(is.character, as.factor) %>%
unite("condition_time", c("condition", "timepoint"))
ct.matrix = model.matrix(~0 + model$condition_time + model$subject)
y = convertTo(sce, type = "edgeR")
y$samples = cbind(y$samples, model)
v <- voom(y, ct.matrix)
fit <- lmFit(v, ct.matrix)
contrast_matrix =
makeContrasts(t1_disease_vs_normal =
(condition_timediseased_2 - condition_timediseased_1) -condition_timenormal_2 - condition_timenormal_1))
c_fit = contrasts.fit(fit = fit, contrasts = contrast_matrix)
eb_c_fit = eBayes(c_fit)
```

In this case though, the sce object has many replicates for each subject-timepoint combination where the replicates are cells of a single cell type.

Question 1
It seems that it's not possible to add subject as a random effect in edgeR https://support.bioconductor.org/p/87292/

So I would have to embed single cell data into an elist object...and not use voom / edgeR in order to accomodate subject as a random effect?
For example I would have to create an list "y" with single cells as observations and use :

```
corfit = duplicateCorrelation(y, response_time_matrix, block=design$subject)
corfit$consensus
```

I'm not sure this is advisable as I'd like to include the size factors calculated by scran as normalization factors for use in voom. One alternate idea is to summarize the scran normalized library by averaging each subject's cells at a given timepoint. Thus you would get a "meta-cell" from cluster k, subject j, timepoint t, of size n after normalizing out cell-specific differences like cell size, capture efficiency differences between cells with scran... this is a slightly different philosophy than using a "pseudobulk" library pool (as in section 6 of the DE testing vignette in the SimpleSingleCell workflow) of all cells as it avoids overrepresentation biases from e.g. large cells in the sample pool.

Question 2 Even if I did this, is there a way I could use such averaged summarized data into a linear model with random effects in limma by embedding the data into a elist object for limma?

This level of model complexity is uncommon in single cell analysis but I thought it would be interesting to look at.

thanks for reading.

-matt

Hi Aaron, thanks very much for your reply.

I would like to clarify one misunderstanding I had about the de testing ecosystem in bioconductor to help other users: In order to use limma with count data, one can create a DGElist object from the edgeR package, which uses limma as a dependency. You can convert a singleCellExperiment object into a DGElist with the

`convertTo`

function from scran. 3 common options for DE testing are limma-voom, DESeq and edgeR. One can implement limma-voom on an edgeR object; rather than using a NB model for the count data like DESeq and edgeR, voom estimates the mean variance relationship on log2 CPM transformed data calculated from normalization factors (e.g. size factors from scran if using single cell data).To implement a random effects model with multilevel experiment designs in limma, one must use the duplicatecorrelation function (see p 49 of limma manual). The duplicatecorrelation function does not run directly on the edgeR object. Reading this post: https://support.bioconductor.org/p/87292/ I thought you could not run random effect models with the count data in limma, but I misunderstood, it's not possible to use duplicatecorrelation

within the edgeR NB model fitting framework. This is explained well here: https://support.bioconductor.org/p/60460/One

canadd random effects within the voom model fitting framework by running`duplicateCorrelation`

on the object returned by`voom`

To my knowledge, this would be the only way to include random effects in single cell de testing. The complexity actually arises in when how many times you should implement duplicatecorrelation which is explained well here: https://support.bioconductor.org/p/59700/For a comparison or performance of different DE strategies I refer other users to this nice mark Robinson paper https://www.nature.com/articles/nmeth.4612 limma-voom seemed to perform well on single cell data especially after appropriate gene filtering and I am seeing the same thing with my data. Heres a quick snippet to only include genes with a count of 1 in at least 10% of cells in a log normalized matrix "lncount" similar to what they did in the paper.

Thanks for the reference to your paper. I will need more time to digest the supplementary material, but I think my experiment is slightly different in that individual variation which in your paper is analogous to the plate effect is (possibly) larger than the group effect being tested; I'm not sure we can get a feel for type I error in our system without seeing how the biological heterogeneity between individuals impacts the model by including it in some test cases. To your point though, I see what you mean about the time needed to run

`duplicateCorrelation`

. Thanks for the thoughts on the averaging vs summation approaches. My colleagues in the lab have had success using an average approach with single cell qPCR data see here:https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005016 https://www.cell.com/fulltext/S2405-4712(17)30087-X

Your point about count data precision increasing as a function of number of cells summed is nicely illustrated in the paper you sent. As you mentioned I think whether or not to have equal contribution of each cell by averaging will depend on the biological question at hand, with summation of bulk counts into a pseudo library being more analagous to an actual RNAseq library of the pooled cells. thanks again -matt