Hello:
A similar issue has already been discussed Single cells batch effects, but I want to make sure that I got it right because I also came across Ding et al who looks into how bulk-RNA normalization methods, including RUVr, work for single cell data. Ding et al conclude that, unless spike-ins are available, RUVr is the best choice among many bulk-RNA methods.
At the same time, as Aaron Lun pointed out, if the study purpose is subpopulation identification (clustering) then both "unwanted factors" and the factors that define the clusters are latent (unobserved) factors. RUVr has no way to tell them apart, and any of the latent factors can be eliminated. In that case, there is a gaping hole in Ding et al paper because they should not have considered RUVr at all.
The queerest part is that RUVr worked well in their study: "Qualitatively, RUVr normalization alleviates difference between scRNA-seq protocols and the clustering results are closer to the ground truth
than the other methods, i.e. the samples are clustered based on the source (HBR) and
the RNA amount (bulk, 100pg or 10pg). However, the UHR samples normalized with
RUVr were still clustered according to protocols rather than RNA amount. The other
four methods showed worse clustering results than RUVr because ..."
I think that, luckily, the variance explained by the nuisance protocol factor was so much higher than the variance explained by the factor of interest (RNA amount) that RUVr decided to remove only the former. However, had the factor of interest been more influential, RUVr would have backfired and removed it. Please let me know if I got it right.
Hello Davide:
Thanks for replying.
I assume that control genes or samples are not available and focus on RUVr only.
Ding et al do not say explicitly what covariates they used in the first pass of RUVr. They probably used those observed factor(s) of interest (RNA amount etc), and then looked at whether the normalized points cluster well by the same factors. In practice it doesn't make sense because the clustering factor is not observed.
To me it looks like for subpopulation identification the only way to remove unwanted variation is when the nuisance factors are known, so one can regress the response on them and continue to work with residuals. I think that's what they do in Seurat pipeline here.
As for the zero inflation, do you have an idea why the most popular Seurat pipeline doesn't have the corresponding adjustment step? Is it too hard to implement or there is still uncertainty about whether it's necessary?
I sent an email to Ding et al and got a reply. Apparently, they used two "factors of interest" (source UHR/HBR and expressed RNA amount) for the X term in RUV equation. Then some unwanted variation (the W term) was thrown out, and of course the normalized values started to cluster much better wrt the factors of interest. So, for subpopulation identification it only make sense to use RUVg or RUVs.