tl;dr A Wilcoxon ranked sum test is not particularly appropriate, but it doesn't really matter because no test will yield a relevant p-value here.
The Wilcoxon test assumes, amongst other things, i.i.d. observations within each group. This is not true of "normalized counts"; a normalized count derived from a large library will be much more precise than the same normalized value derived from a small library. It's hard to predict the effect of this violation, but it's unlikely that you'll get a happy ending here. I'm also guessing that there will be further problems from the excessive number of tied observations at zero.
The "better" approach is to fit a count-based GLM, e.g., with edgeR or friends. You can even do this by subsetting down to the single gene of interest and running that through the pipeline; with a large number of cells, there is no benefit offered by empirical Bayes shrinkage, and so nothing is really lost by ignoring the genes you don't care about. (If you do this with edgeR, make sure you compute the library sizes _before_ subsetting, otherwise your single gene will be used to compute its own normalization factors, thus cancelling itself out. Also, I would suggest using glmLRT()
rather than the glmQLFTest()
recommended for bulk analyses.)
However, all of this minutiae is largely irrelevant because cells are not replicates. Experimental samples are replicates; if someone asked you to repeat the experiment, you'd do it with a fresh set of cells from a new donor/mouse/whatever. Any $p$-value that you get from treating cells as replicates will be much lower than what it should be. In many single-cell contexts, this fact is swept under the carpet because we don't care about the actual value of the $p$-value, we only care about using the $p$-value to rank genes, e.g., for marker detection. In your case, though, it seems you're trying to use the $p$-value to make a statement about significant differences between cell types. Such an interpretation would only make sense if your repeat experiment involved re-sampling from the exact same pool of cells, which doesn't seem as if it would offer general insights into your biological system of interest.
Further comments are provided here. If you do have multiple samples, then you're in a better position, and you can consider reading this.
Thank you for all of the information and resources provided. I had failed to mention that there are three donors the two cell types were each taken from. The DE examples I have seen examine expression changes in cells of the same type between experimental conditions. However, how would I approach pseudo-bulking when I have no additional experimental conditions like treatment, and am only interested in cell type A vs B?
It's pretty easy. After you run
aggregateAcrossCells()
, just subset yourSummarizedExperiment
so that it only contains the two cell types of interest (i.e., 6 columns total, each cell type in each of 3 donors). Then it's just a matter of applying the usual DE analysis machinery with an additive model, i.e.,~donor + cell.type
, and testing the last coefficient for a significant difference from zero.As an aside, and for the sake of completeness: in theory, a more powerful approach is to use mixed models to account for the per-cell information, which is lost in the pseudo-bulk analysis. In practice, it's slow, inconvenient and the results aren't necessarily better (see comments here). Pseudo-bulking is usually a good-enough approach here.
Thank you again. I pseudo-bulked raw counts into 6 columns (each of 2 cell types in 3 donors) and went through edgeR using
glmLRT()
and the additive model as you suggested. I got a very low number of genes within a 5% FDR, which I expected based on the nature of the cell types. However, I'm very confused by the fact that most are ERCC spike-in transcripts. Does this indicate I made a fundamental error while trying to pseudobulk or run the edgeR pipeline?If I had to guess, I would say that this is because the normalization for the ERCCs is not comparable to that of the endogenous transcripts if there are differences in RNA content between your groups, see comments here. In cell types with more RNA, the endogenous RNA will compete with the spike-ins and reduce the coverage of the latter. This effect is not handled by the usual normalization methods, which focus on removing differences in the majority of genes (i.e., the endogenous ones).
Best bet is to just remove the spike-ins, though if you're confident that there are no changes in RNA content between your cell types, you may want to do some more investigative work on whether the data was generated as expected. Maybe someone added more spike-in than intended to a subset of the plates, etc.