I have two scRNA-seq datasets, both derived from a heterogeneous population of cells (comprising similar numbers of cell types and subpopulations). One sample has been treated with a drug, the other has not. While I can separate out the cellular subpopulations in each dataset, what I want is to then identify differentially expressed genes that result from the drug treatment - in each of the individual cellular subpopulations.
In theory I can do this by extracting average transcript counts for each gene and then use a program such a Gfold to run a differential expression analysis by treating each dataset as a bulk-rnaseq experiment. However this doesn't seem very elegant, and more importantly does not take into account the distribution of transcript counts for a given gene in a given subpopulation.
Are there any packages out there designed for these types of analyses? Does anyone have any thoughts on how I might approach this otherwise?
I presume that since you are talking about similar numbers of cell types and subpopulations, you have already run some kind of cluster analysis that gave you cluster labels for each of the cells.
If that's the case, I would use a differential expression method (for instance MAST) to compare the two populations conditional on the cell type. You can do this by specifying the right design matrix, possibly with cell-type / treatment interactions.
The edgeR and limma user guides are possibly the best places to start to learn how to specify the design matrix and the right "contrasts" that you need for the test.
Obviously, there are many different methods for single-cell differential expression. MAST is one of them. Someone else with more direct experience can comment on their relative performance.
However, there are two very important caveats to consider.
You are using the data twice: the cluster labels are data-driven, but then included in the model as if they were known. This means that the p-values that you obtain from this analysis are not valid (but you can still use the ranking of the genes for exploratory / hypothesis generating purposes).
Most importantly, it seems like you potentially have a completely confounded design! If, as it seems from your description, all the untreated cells were harvested in one batch and all the treated cells in a different batch, you will never be able to tell if the effects that you observe are due to the drug or to batch effects.
IMO, the only hope to get meaningful results out of this experiment is to replicate it in multiple batches, so that you can compare the difference between batches of the same treatment to the difference between treatments.
Thanks for the comments - I've been looking at MAST and I think this might well suit my requirements..
Your point about cluster labels being data-driven is well taken and I will have to think carefully about interpretation.
As for a confounded design, it is more the case of a poorly written description on my part... The study design is such that a single population of heterogeneous cells was split in two, one treated with a drug the other as a control, and then library prepped and sequenced together. This should (in theory) remove any batch effects.
This all depends on how you define "batch." As I understand your design, statistics won't be able to tell you if any differences you see should be ascribed to "treatment" or "plate". Unless you run some replicates to estimate, or bound, the plate-to-plate variability, these two factors are confounded.
Thanks for the comments - I've been looking at MAST and I think this might well suit my requirements..
Your point about cluster labels being data-driven is well taken and I will have to think carefully about interpretation.
As for a confounded design, it is more the case of a poorly written description on my part... The study design is such that a single population of heterogeneous cells was split in two, one treated with a drug the other as a control, and then library prepped and sequenced together. This should (in theory) remove any batch effects.
This all depends on how you define "batch." As I understand your design, statistics won't be able to tell you if any differences you see should be ascribed to "treatment" or "plate". Unless you run some replicates to estimate, or bound, the plate-to-plate variability, these two factors are confounded.