Dear bioconductor community,
I'm currently analysing a 10X scRNA seq. data set that contains multiple different cell types using the 'standard' workflow of Seurat; in which I end up with different clusters that are likely to be those cell types. In order to confirm this, I would now like to correlate the resulting clusters to bulk RNA seq data (analysed via edgeR) of those exact cell types; and was wondering what the best practice would be here.
Both data sets are normalised and log transformed [in the scRNA seq workflow using Seurat::NormalizeData(object, normalization.method = "LogNormalize", scale.factor = 10000) and in the bulkRNA seq workflow (edgeR::cpm(y, normalized.lib.sizes = TRUE, log = TRUE after TMM normalisation) ], so I am assuming that it is possible to compute the Spearman correlation of the gene expression of each sample of the bulk RNA seq data set to the AverageExpression() of each cluster.
The results do seem reasonable in the sense that the assumed cell types of the sc-data do correlate the strongest to the bulk data. However, the overall rho's are surprisingly close (i.e. in the corresponding cell types, they are sometimes not more than 0.02 higher than to other cell types) and overall fairly high (e.g. ranging between 0.5 and 0.7).
Thus, I would now have the following questions: 1) Is this an overall legit approach to confirm these clusters? 2) Is it correct to use the normalised values, even though the normalisation methods (and pre-filtering steps) are different? 3) Is Spearman correlation the way to go here or would you suggest something else? 4) If 1-3 are fine, would the generally high correlations values suggest that overall gene expression is similar or how could this result be interpreted?
Apologies for not providing specific code here but since this is a more conceptual question and I hope it becomes sufficiently clear by description.
Hope everyone is healthy during this global pandemic. Tobias
