Question

Does it make sense to consider RUVr normalization for single cell RNA analysis?

0

Entering edit mode

Nik Tuzov ▴ 80

@nik-tuzov-8783

Last seen 3 months ago

United States

Hello:

A similar issue has already been discussed Single cells batch effects, but I want to make sure that I got it right because I also came across Ding et al who looks into how bulk-RNA normalization methods, including RUVr, work for single cell data. Ding et al conclude that, unless spike-ins are available, RUVr is the best choice among many bulk-RNA methods.

At the same time, as Aaron Lun pointed out, if the study purpose is subpopulation identification (clustering) then both "unwanted factors" and the factors that define the clusters are latent (unobserved) factors. RUVr has no way to tell them apart, and any of the latent factors can be eliminated. In that case, there is a gaping hole in Ding et al paper because they should not have considered RUVr at all.

The queerest part is that RUVr worked well in their study: "Qualitatively, RUVr normalization alleviates difference between scRNA-seq protocols and the clustering results are closer to the ground truth
than the other methods, i.e. the samples are clustered based on the source (HBR) and
the RNA amount (bulk, 100pg or 10pg). However, the UHR samples normalized with
RUVr were still clustered according to protocols rather than RNA amount. The other
four methods showed worse clustering results than RUVr because ..."

I think that, luckily, the variance explained by the nuisance protocol factor was so much higher than the variance explained by the factor of interest (RNA amount) that RUVr decided to remove only the former. However, had the factor of interest been more influential, RUVr would have backfired and removed it. Please let me know if I got it right.

RUV ruvseq ruvr ruvnormalize single cell • 2.2k views

ADD COMMENT • link 6.7 years ago Nik Tuzov ▴ 80

score 1 · Answer 1 · 2017-08-01

Hi Nik,

I have to admit that I haven't read the Ding et al. paper. I will look at it and come back for a more sensible reply. In the meantime there are a couple of considerations that I think are worth it.

RUVr is based on residuals, i.e., it first removes the factor of interest and then look for unwanted variation in what's left. It may be that the authors knew the factor of interest and used RUVr in a way that preserved it to remove any biologically meaningful signal (if that's the case it's not a very fair comparison as most of the time with scRNA-seq you have to infer the signal of interest).
I generally agree with Aaron that you need to be careful with RUV in single-cell data. The main assumption of RUV is that there are some negative control genes that are not influenced by the biology, hence providing a way to estimate and remove unwanted variation. This is risky because it won't work if the negative controls are not truly negative or if the factor of interest is correlated with the factors of unwanted variation.
That said, we have used RUV for some single-cell data in the scone framework, where we compare various ways to account for unwanted variation in the data (see our scone package). The overall recommendation would be "use with extreme caution" but in some cases, when there is strong unwanted variation reasonably distinguishable (i.e., uncorrelated) from wanted variation, RUV works well for single-cell data. (We should hopefully have a preprint about this soon!)
One peculiar aspect of scRNA-seq is the abundance of zeros, hence a negative binomial model may not always be appropriate and using a zero-inflated model may be useful (see our zinbwave package). In principle, you can use the zinbwave model to infer factors of unwanted variation although we use it mostly for the opposite (i.e., exctract unknown biological signal).

I'm sorry that I don't have a better answer for now. Please let me know if any of this doesn't make sense and I will try to explain better!