Question

differential expression for Single-cell RNAseq data using edgeR

1

Entering edit mode

mbenegas ▴ 10

@1deca0b7

Last seen 4.4 years ago

Spain

Hi! I'm new with edgeR and Differential Expression analysis in general so I'm having some trouble designing the analysis for my single-cell RNAseq data.

First of all, my input is a count matrix in a txt file with genes in rows, cells in columns, and expression level as values (number of mapped reads). With this, I've performed a clustering analysis using Seurat so now I have groups of cells that potentially are of the same cell type. Moreover, I have cells coming from two samples from different conditions (WT and mutant). So, to sum up, I have something similar to this toy experimental design:

> exp.design
        cluster condition
cell1  cluster1        WT
cell2  cluster1        WT
cell3  cluster1    mutant
cell4  cluster1    mutant
cell5  cluster2        WT
cell6  cluster2        WT
cell7  cluster2    mutant
cell8  cluster2    mutant
cell9  cluster3        WT
cell10 cluster3        WT
cell11 cluster3    mutant
cell12 cluster3    mutant

And now I wanted to use edgeR (since it is highly recommended by the scientific community) to perform a DE to answer the following questions:

Which genes are differentially expressed for each cluster (cell type) independently of the condition? That is, which genes help differentiate each one of the clusters from the others? That would help me to find genes that characterize the cluster so I can infer the cell type and potentially identify marker genes.
- Within one cluster, which genes are differentially expressed between conditions? That would help to see if one condition affects the gene expression in one particular cell type.

Now that I've explained the context, a few questions have come to my mind when designing the steps for the analysis:

With the experimental design provided above, each cell will be treated as a replicate. Is this okay? I've read in some blogs that it's recommendable to perform a "pseudobulk approach", that is, to aggregate the counts coming from cells grouped in the same cluster. They say so because they consider that each cell is not truly a replicate since it comes from the same biological sample (it's the case of cell1 and cell2, for example). However, in other sites, they don't recommend this practice since it doesn't take advantage of the power of single-cell data. Which are your opinions? what's the best way to proceed?

And regarding the analysis with edgeR specifically:

Which would be the best way to proceed to answer the first question? Is it okay to block the "condition" factor as it is explained in section 3.4.2 of the manual?
Regarding the first question as well, what would be the best procedure to test one cluster against the others (once we control the "condition" factor)?

3.1. Do all the pairwise comparisons between clusters (e.g. cluster1 vs cluster2, and cluster1 vs cluster3) and keep the DE genes in all comparisons? 3.2. Compare the expression of the desired cluster against the average of the rest? With:

average.model<- makeContrasts( clust1 - (clust2 + clust3) / 2 , levels = designmatrix)
de.average<- glmLRT(fit, contrast = average model)

To address the second question, it would be fine to combine cluster+condition into one group and perform DE for each cluster? For example, clust1.WT vs clust1.mutant. Something like:

enter image description here

> design.matrix <- model.matrix(~0+cluster.condition)
> contrast.c1 <- makeContrasts(cluster1.WT - cluster1.mutant, levels = design.matrix)
> de.cluster1 <- glmLRT(fit, contrast = contrast.c1)

Sorry for the long questions, let me know if I didn't explain myself well at any point. Thanks in advance!

edgeR single scRNAseq • 6.4k views

ADD COMMENT • link 4.5 years ago • updated 4.4 years ago mbenegas ▴ 10

score 1 · Answer 1 · 2021-07-22

1

Entering edit mode

Yunshun Chen ▴ 900

@yunshun-chen-5451

Last seen 14 months ago

Australia

I would recommend the pseudo-bulk approach for your first comparison (DE between clusters regardless of the condition). I found pseudo-bulk work better than single-cell level DE analysis from my own experience. Also according to Crowell et al. 2020, pseudo-bulk methods outperform single-cell level methods in general.

For your edgeR specific questions:

Which would be the best way to proceed to answer the first question? Is it okay to block the "condition" factor as it is explained in section 3.4.2 of the manual?

Yes, it is okay.

Regarding the first question as well, what would be the best procedure to test one cluster against the others (once we control the "condition" factor)?

That depends on what you are interested in. The pairwise comparison approach is more stringent, which ensures the DE genes are unique for the specific cluster. On the other hand, the "one-vs-average" approach is less conservative. If you are looking for cluster specific marker genes, then the pairwise comparison is probably the one to go with.

To address the second question, it would be fine to combine cluster+condition into one group and perform DE for each cluster?

Yes, it would be fine.

ADD COMMENT • link 4.5 years ago Yunshun Chen ▴ 900

0

Entering edit mode

Thank you very much for your answer and thanks to Gordon Smyth too! I was going further with the analysis and now I'm stuck again in how to block the "condition" factor. Sorry for the bother, but I don't fully understand how the model.matrix() works.

As I've mentioned in my post, what I want is to know which genes are differentially expressed for each cluster (cell type) independently of the condition. If I go as the manual says I obtain a matrix like this:

clusters <- factor(exp.design$clusters)
conditions <- factor(exp.design$condition)
model.design <- model.matrix(~conditions+clusters)
head(model.design)
  (Intercept) conditionsmutant clusterscluster_10 clusterscluster_2 clusterscluster_3 clusterscluster_4 clusterscluster_5 clusterscluster_6 clusterscluster_7
1           1                0                  0                 0                 0                 0                 0                 0                 0
2           1                0                  0                 0                 0                 0                 0                 0                 1
3           1                0                  0                 0                 0                 0                 0                 0                 0
4           1                0                  0                 0                 0                 0                 0                 0                 0
5           1                0                  0                 0                 1                 0                 0                 0                 0
6           1                0                  0                 0                 0                 0                 0                 1                 0
  clusterscluster_8 clusterscluster_9
1                 0                 0
2                 0                 0
3                 0                 0
4                 0                 0
5                 0                 0
6                 0                 0

As far as I've understood since there's an "Intercept" and the columns "conditionsWT" and "clusterscluster_1" don't appear, the condition "WT" (wild-type) and the cluster "cluster_1" are treated as a reference. I'm fine with the wild-type condition being treated as a reference. What I don't want is to treat "cluster_1" as such, since there's no prior reason to do that.

Eventually, I would like to compare each cluster against the other but handling the "condition" effect somehow in the model matrix. I would like to do something like this:

  y <- estimateDisp(y, design=model.design)
  fit <- glmQLFit(y, model.design)
  clusters <- y$samples$group
  for(clustA in levels(clusters)){
    for (clustB in levels(clusters)){
      if (clustB == clustA) { next }
      print(paste("Computing ", clustA, " vs ", clustB, " ...", sep = ""))
      difference <- paste(clustA, "-", clustB, sep = "")
      contrast <-  makeContrasts(contrasts = difference, levels = model.design)
      de <- glmQLFTest(fit, contrast = contrast)
      topGenes <- topTags(de)
      file.name <- paste("~/DE_test/output/", clustA, "_vs_", clustB, ".txt", sep = "")
      write.table(topGenes, file = file.name, quote=FALSE, sep="\t")
    }                 
  }

How should I build the model.matrix()? Is model.design <- model.matrix(~0+clusters+conditions) correct? That seems to do the trick:

model.design <- model.matrix(~0+clusters+conditions)
head(model.design)
  clusterscluster_1 clusterscluster_10 clusterscluster_2 clusterscluster_3 clusterscluster_4 clusterscluster_5 clusterscluster_6 clusterscluster_7
1                 1                  0                 0                 0                 0                 0                 0                 0
2                 0                  0                 0                 0                 0                 0                 0                 1
3                 1                  0                 0                 0                 0                 0                 0                 0
4                 1                  0                 0                 0                 0                 0                 0                 0
5                 0                  0                 0                 1                 0                 0                 0                 0
6                 0                  0                 0                 0                 0                 0                 1                 0
  clusterscluster_8 clusterscluster_9 conditionsmutant
1                 0                 0                0
2                 0                 0                0
3                 0                 0                0
4                 0                 0                0
5                 0                 0                0
6                 0                 0                0

Any help will be appreciated,

Marta.

ADD REPLY • link 4.4 years ago mbenegas ▴ 10

0

Entering edit mode

Hi all! In case someone gets until here with the same doubts as me, I have encountered this amazing, helpful guide: "A guide to creating design matrices for gene expression experiments". You can jump directly to the section of interest, but I really encourage people who are new to this type of analysis to read it entirely.

Now I will try to answer my own question on how to build the design matrix in my particular case. When the explanatory variable is categorical, the model matrix is equivalent with and without the intercept term. See section 4.4 of the above-mentioned guide for more details. So I shouldn't worry too much about putting the ~0 and treating a cluster or condition as a "reference". However, I prefer to put it for clarity.

Moreover, it is the same to do model.matrix(clusters+conditions) as model.matrix(conditions+clusters). It doesn't matter the order, both models are adding the effects of the two variables. The only difference is that columns in the model matrix will be ordered differently. So what we should care about is to indicate correctly the indices in the coef argument in the glmQLFTest() and glmLRT() functions. Another way to do it avoiding the indices is by using the makeContrasts() function. See section 7.3 of the guide to learn more about accounting for factors that are not of interest.

To sum up, I ended building this matrix and performing the DE like this:

> clusters <- factor(exp.design$clusters)
> conditions <- factor(exp.design$condition)
> conditions <- relevel(conditions, ref="WT")
> model.design <- model.matrix(~0+clusters+conditions)
> colnames(model.design) <- c(levels(clusters),levels(conditions)[2:length(levels(conditions))])
> head(model.design)
  cluster_1 cluster_10 cluster_2 cluster_3 cluster_4 cluster_5 cluster_6 cluster_7 cluster_8 cluster_9 mutant
1         1          0         0         0         0         0         0         0         0         0      0
2         0          0         0         0         0         0         0         1         0         0      0
3         1          0         0         0         0         0         0         0         0         0      0
4         1          0         0         0         0         0         0         0         0         0      0
5         0          0         0         1         0         0         0         0         0         0      0
6         0          0         0         0         0         0         1         0         0         0      0
> y <- estimateDisp(y, design=model.design)
> fit <- glmQLFit(y, model.design)
> clusters <- y$samples$group
> for(clustA in levels(clusters)){
+   for (clustB in levels(clusters)){
+     if (clustB == clustA) { next }
+     print(paste("Computing ", clustA, " vs ", clustB, " ...", sep = ""))
+     difference <- paste(clustA, "-", clustB, sep = "")
+     contrast <-  makeContrasts(contrasts = difference, levels = model.design)
+     de <- glmQLFTest(fit, contrast = contrast)
+     topGenes <- topTags(de)
+     file.name <- paste("~/Downloads/DE_test/output/", clustA, "_vs_", clustB, ".txt", sep = "")
+     write.table(topGenes, file = file.name, quote=FALSE, sep="\t")
+   }                 
+ }

This is my solution so far! If anyone finds an error, I would be very grateful if you could comment on it.

ADD REPLY • link 4.4 years ago mbenegas ▴ 10

score 1 · Answer 2 · 2021-07-24

With the experimental design provided above, each cell will be treated as a replicate. Is this okay?

No. Or only if you're going to ignore the p-values and just use the gene ranking, because the p-values will be wildly anti-conservative, so the point of being meaningless.

I've read in some blogs that it's recommendable to perform a "pseudobulk approach", that is, to aggregate the counts coming from cells grouped in the same cluster. They say so because they consider that each cell is not truly a replicate since it comes from the same biological sample (it's the case of cell1 and cell2, for example).

Yes, while a few methods have been suggested, pseudobulk is the only easy to use and reliable approach at the moment that evaluates DE relative to biological variation between samples. Yunshun and I used pseudobulk in our own paper:

Pal B, Chen Y, Vaillant F, Capaldo BD, Joyce R, Song X, Bryant VL, Penington JS, Di Stefano L, Ribera NT, Wilcox S, Mann GB, kConFab, Papenfuss AT, Lindeman GJ, Smyth GK, Visvader JE (2021). A single cell RNA atlas of human breast spanning normal, preneoplastic and tumorigenic states. EMBO Journal 40(11), e3107333. https://doi.org/10.15252/embj.2020107333

However, in other sites, they don't recommend this practice since it doesn't take advantage of the power of single-cell data.

I am not aware of any sites that say that, including the site you link to. On the contrary, the site you link to also recommends pseudo-bulk in the context you are considering.