Does it make sense to consider RUVr normalization for single cell RNA analysis?
Entering edit mode
Nik Tuzov ▴ 80
Last seen 17 months ago
United States


A similar issue has already been discussed Single cells batch effects, but I want to make sure that I got it right because I also came across Ding et al who looks into how bulk-RNA normalization methods, including RUVr, work for single cell data. Ding et al conclude that, unless spike-ins are available, RUVr is the best choice among many bulk-RNA methods.

At the same time, as Aaron Lun pointed out, if the study purpose is subpopulation identification (clustering) then both "unwanted factors" and the factors that define the clusters are latent (unobserved) factors. RUVr has no way to tell them apart, and any of the latent factors can be eliminated. In that case, there is a gaping hole in Ding et al paper because they should not have considered RUVr at all.

The queerest part is that RUVr worked well in their study: "Qualitatively, RUVr normalization alleviates difference between scRNA-seq protocols and the clustering results are closer to the ground truth
than the other methods, i.e. the samples are clustered based on the source (HBR) and
the RNA amount (bulk, 100pg or 10pg). However, the UHR samples normalized with
RUVr were still clustered according to protocols rather than RNA amount. The other
four methods showed worse clustering results than RUVr because ..."

I think that, luckily, the variance explained by the nuisance protocol factor was so much higher than the variance explained by the factor of interest (RNA amount) that RUVr decided to remove only the former. However, had the factor of interest been more influential, RUVr would have backfired and removed it. Please let me know if I got it right.

RUV ruvseq ruvr ruvnormalize single cell • 2.0k views
Entering edit mode
davide risso ▴ 930
Last seen 2.1 years ago
University of Padova

Hi Nik,

I have to admit that I haven't read the Ding et al. paper. I will look at it and come back for a more sensible reply. In the meantime there are a couple of considerations that I think are worth it.

  • RUVr is based on residuals, i.e., it first removes the factor of interest and then look for unwanted variation in what's left. It may be that the authors knew the factor of interest and used RUVr in a way that preserved it to remove any biologically meaningful signal (if that's the case it's not a very fair comparison as most of the time with scRNA-seq you have to infer the signal of interest).
  • I generally agree with Aaron that you need to be careful with RUV in single-cell data. The main assumption of RUV is that there are some negative control genes that are not influenced by the biology, hence providing a way to estimate and remove unwanted variation. This is risky because it won't work if the negative controls are not truly negative or if the factor of interest is correlated with the factors of unwanted variation.
  • That said, we have used RUV for some single-cell data in the scone framework, where we compare various ways to account for unwanted variation in the data (see our scone package). The overall recommendation would be "use with extreme caution" but in some cases, when there is strong unwanted variation reasonably distinguishable (i.e., uncorrelated) from wanted variation, RUV works well for single-cell data. (We should hopefully have a preprint about this soon!)
  • One peculiar aspect of scRNA-seq is the abundance of zeros, hence a negative binomial model may not always be appropriate and using a zero-inflated model may be useful (see our zinbwave package). In principle, you can use the zinbwave model to infer factors of unwanted variation although we use it mostly for the opposite (i.e., exctract unknown biological signal).

I'm sorry that I don't have a better answer for now. Please let me know if any of this doesn't make sense and I will try to explain better!

Entering edit mode

Hello Davide:

Thanks for replying.

I assume that control genes or samples are not available and focus on RUVr only.

Ding et al do not say explicitly what covariates they used in the first pass of RUVr. They probably used those observed factor(s) of interest (RNA amount etc), and then looked at whether the normalized points cluster well by the same factors. In practice it doesn't make sense because the clustering factor is not observed.

To me it looks like for subpopulation identification the only way to remove unwanted variation is when the nuisance factors are known, so one can regress the response on them and continue to work with residuals. I think that's what they do in Seurat pipeline here

As for the zero inflation, do you have an idea why the most popular Seurat pipeline doesn't have the corresponding adjustment step? Is it too hard to implement or there is still uncertainty about whether it's necessary?

Entering edit mode

I sent an email to Ding et al and got a reply. Apparently, they used two "factors of interest" (source UHR/HBR and expressed RNA amount) for the X term in RUV equation. Then some unwanted variation (the W term) was thrown out, and of course the normalized values started to cluster much better wrt the factors of interest. So, for subpopulation identification it only make sense to use RUVg or RUVs.



Login before adding your answer.

Traffic: 577 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6