differential expression for Single-cell RNAseq data using edgeR
mbenegas
Hi! I'm new with edgeR and Differential Expression analysis in general so I'm having some trouble designing the analysis for my single-cell RNAseq data.

First of all, my input is a count matrix in a txt file with genes in rows, cells in columns, and expression level as values (number of mapped reads). With this, I've performed a clustering analysis using Seurat so now I have groups of cells that potentially are of the same cell type. Moreover, I have cells coming from two samples from different conditions (WT and mutant). So, to sum up, I have something similar to this toy experimental design:

> exp.design
cluster condition
cell1  cluster1        WT
cell2  cluster1        WT
cell3  cluster1    mutant
cell4  cluster1    mutant
cell5  cluster2        WT
cell6  cluster2        WT
cell7  cluster2    mutant
cell8  cluster2    mutant
cell9  cluster3        WT
cell10 cluster3        WT
cell11 cluster3    mutant
cell12 cluster3    mutant


And now I wanted to use edgeR (since it is highly recommended by the scientific community) to perform a DE to answer the following questions:

• Which genes are differentially expressed for each cluster (cell type) independently of the condition? That is, which genes help differentiate each one of the clusters from the others? That would help me to find genes that characterize the cluster so I can infer the cell type and potentially identify marker genes.
• Within one cluster, which genes are differentially expressed between conditions? That would help to see if one condition affects the gene expression in one particular cell type.

Now that I've explained the context, a few questions have come to my mind when designing the steps for the analysis:

1. With the experimental design provided above, each cell will be treated as a replicate. Is this okay? I've read in some blogs that it's recommendable to perform a "pseudobulk approach", that is, to aggregate the counts coming from cells grouped in the same cluster. They say so because they consider that each cell is not truly a replicate since it comes from the same biological sample (it's the case of cell1 and cell2, for example). However, in other sites, they don't recommend this practice since it doesn't take advantage of the power of single-cell data. Which are your opinions? what's the best way to proceed?

And regarding the analysis with edgeR specifically:

1. Which would be the best way to proceed to answer the first question? Is it okay to block the "condition" factor as it is explained in section 3.4.2 of the manual?

2. Regarding the first question as well, what would be the best procedure to test one cluster against the others (once we control the "condition" factor)?

3.1. Do all the pairwise comparisons between clusters (e.g. cluster1 vs cluster2, and cluster1 vs cluster3) and keep the DE genes in all comparisons? 3.2. Compare the expression of the desired cluster against the average of the rest? With:

average.model<- makeContrasts( clust1 - (clust2 + clust3) / 2 , levels = designmatrix)
de.average<- glmLRT(fit, contrast = average model)

1. To address the second question, it would be fine to combine cluster+condition into one group and perform DE for each cluster? For example, clust1.WT vs clust1.mutant. Something like:

> design.matrix <- model.matrix(~0+cluster.condition)
> contrast.c1 <- makeContrasts(cluster1.WT - cluster1.mutant, levels = design.matrix)
> de.cluster1 <- glmLRT(fit, contrast = contrast.c1)


Sorry for the long questions, let me know if I didn't explain myself well at any point. Thanks in advance!

Yunshun Chen
I would recommend the pseudo-bulk approach for your first comparison (DE between clusters regardless of the condition). I found pseudo-bulk work better than single-cell level DE analysis from my own experience. Also according to Crowell et al. 2020, pseudo-bulk methods outperform single-cell level methods in general.

For your edgeR specific questions:

Which would be the best way to proceed to answer the first question? Is it okay to block the "condition" factor as it is explained in section 3.4.2 of the manual?

Yes, it is okay.

Regarding the first question as well, what would be the best procedure to test one cluster against the others (once we control the "condition" factor)?

That depends on what you are interested in. The pairwise comparison approach is more stringent, which ensures the DE genes are unique for the specific cluster. On the other hand, the "one-vs-average" approach is less conservative. If you are looking for cluster specific marker genes, then the pairwise comparison is probably the one to go with.

To address the second question, it would be fine to combine cluster+condition into one group and perform DE for each cluster?

Yes, it would be fine.

With the experimental design provided above, each cell will be treated as a replicate. Is this okay?

No. Or only if you're going to ignore the p-values and just use the gene ranking, because the p-values will be wildly anti-conservative, so the point of being meaningless.

I've read in some blogs that it's recommendable to perform a "pseudobulk approach", that is, to aggregate the counts coming from cells grouped in the same cluster. They say so because they consider that each cell is not truly a replicate since it comes from the same biological sample (it's the case of cell1 and cell2, for example).

Yes, while a few methods have been suggested, pseudobulk is the only easy to use and reliable approach at the moment that evaluates DE relative to biological variation between samples. Yunshun and I used pseudobulk in our own paper:

Pal B, Chen Y, Vaillant F, Capaldo BD, Joyce R, Song X, Bryant VL, Penington JS, Di Stefano L, Ribera NT, Wilcox S, Mann GB, kConFab, Papenfuss AT, Lindeman GJ, Smyth GK, Visvader JE (2021). A single cell RNA atlas of human breast spanning normal, preneoplastic and tumorigenic states. EMBO Journal 40(11), e3107333. https://doi.org/10.15252/embj.2020107333

However, in other sites, they don't recommend this practice since it doesn't take advantage of the power of single-cell data.

I am not aware of any sites that say that, including the site you link to. On the contrary, the site you link to also recommends pseudo-bulk in the context you are considering.