Hi! I'm new with edgeR and Differential Expression analysis in general so I'm having some trouble designing the analysis for my single-cell RNAseq data.
First of all, my input is a count matrix in a txt file with genes in rows, cells in columns, and expression level as values (number of mapped reads). With this, I've performed a clustering analysis using Seurat so now I have groups of cells that potentially are of the same cell type. Moreover, I have cells coming from two samples from different conditions (WT and mutant). So, to sum up, I have something similar to this toy experimental design:
> exp.design cluster condition cell1 cluster1 WT cell2 cluster1 WT cell3 cluster1 mutant cell4 cluster1 mutant cell5 cluster2 WT cell6 cluster2 WT cell7 cluster2 mutant cell8 cluster2 mutant cell9 cluster3 WT cell10 cluster3 WT cell11 cluster3 mutant cell12 cluster3 mutant
And now I wanted to use edgeR (since it is highly recommended by the scientific community) to perform a DE to answer the following questions:
- Which genes are differentially expressed for each cluster (cell type) independently of the condition? That is, which genes help differentiate each one of the clusters from the others? That would help me to find genes that characterize the cluster so I can infer the cell type and potentially identify marker genes.
- Within one cluster, which genes are differentially expressed between conditions? That would help to see if one condition affects the gene expression in one particular cell type.
Now that I've explained the context, a few questions have come to my mind when designing the steps for the analysis:
- With the experimental design provided above, each cell will be treated as a replicate. Is this okay? I've read in some blogs that it's recommendable to perform a "pseudobulk approach", that is, to aggregate the counts coming from cells grouped in the same cluster. They say so because they consider that each cell is not truly a replicate since it comes from the same biological sample (it's the case of cell1 and cell2, for example). However, in other sites, they don't recommend this practice since it doesn't take advantage of the power of single-cell data. Which are your opinions? what's the best way to proceed?
And regarding the analysis with edgeR specifically:
Which would be the best way to proceed to answer the first question? Is it okay to block the "condition" factor as it is explained in section 3.4.2 of the manual?
Regarding the first question as well, what would be the best procedure to test one cluster against the others (once we control the "condition" factor)?
3.1. Do all the pairwise comparisons between clusters (e.g. cluster1 vs cluster2, and cluster1 vs cluster3) and keep the DE genes in all comparisons? 3.2. Compare the expression of the desired cluster against the average of the rest? With:
average.model<- makeContrasts( clust1 - (clust2 + clust3) / 2 , levels = designmatrix) de.average<- glmLRT(fit, contrast = average model)
- To address the second question, it would be fine to combine cluster+condition into one group and perform DE for each cluster? For example, clust1.WT vs clust1.mutant. Something like:
> design.matrix <- model.matrix(~0+cluster.condition) > contrast.c1 <- makeContrasts(cluster1.WT - cluster1.mutant, levels = design.matrix) > de.cluster1 <- glmLRT(fit, contrast = contrast.c1)
Sorry for the long questions, let me know if I didn't explain myself well at any point. Thanks in advance!