Question

Differential Expression analysis between distinct scRNA-seq datasets

0

Entering edit mode

d.depledge • 0

@ddepledge-11508

Last seen 6.4 years ago

I have two scRNA-seq datasets, both derived from a heterogeneous population of cells (comprising similar numbers of cell types and subpopulations). One sample has been treated with a drug, the other has not. While I can separate out the cellular subpopulations in each dataset, what I want is to then identify differentially expressed genes that result from the drug treatment - in each of the individual cellular subpopulations.

In theory I can do this by extracting average transcript counts for each gene and then use a program such a Gfold to run a differential expression analysis by treating each dataset as a bulk-rnaseq experiment. However this doesn't seem very elegant, and more importantly does not take into account the distribution of transcript counts for a given gene in a given subpopulation.

Are there any packages out there designed for these types of analyses? Does anyone have any thoughts on how I might approach this otherwise?

Thanks

scrna scrnaseq differential gene expression • 2.4k views

ADD COMMENT • link updated 8.3 years ago by davide risso ▴ 980 • written 8.3 years ago by d.depledge • 0

score 1 · Answer 1 · 2017-11-02

I presume that since you are talking about similar numbers of cell types and subpopulations, you have already run some kind of cluster analysis that gave you cluster labels for each of the cells.

If that's the case, I would use a differential expression method (for instance MAST) to compare the two populations conditional on the cell type. You can do this by specifying the right design matrix, possibly with cell-type / treatment interactions.

The edgeR and limma user guides are possibly the best places to start to learn how to specify the design matrix and the right "contrasts" that you need for the test.

Obviously, there are many different methods for single-cell differential expression. MAST is one of them. Someone else with more direct experience can comment on their relative performance.

However, there are two very important caveats to consider.

You are using the data twice: the cluster labels are data-driven, but then included in the model as if they were known. This means that the p-values that you obtain from this analysis are not valid (but you can still use the ranking of the genes for exploratory / hypothesis generating purposes).
Most importantly, it seems like you potentially have a completely confounded design! If, as it seems from your description, all the untreated cells were harvested in one batch and all the treated cells in a different batch, you will never be able to tell if the effects that you observe are due to the drug or to batch effects.

IMO, the only hope to get meaningful results out of this experiment is to replicate it in multiple batches, so that you can compare the difference between batches of the same treatment to the difference between treatments.