Question

Number of co-variables k in RUVSeq (RUVr)

0

Entering edit mode

David R ▴ 90

@david-rengel-6321

Last seen 4 months ago

European Union

Hi,

I use RUVSeq and I find it extremely helpful. I have a question concerning
the number of covariables to be used under RUVr. I've realized that
increasing the number of covariables makes the groups I want to see on the
PCA more visible and distinct from each other. It follows the the number of
DE genes also increases with k.
In one of my projects I have 72 samples and I run RUVr with up to k=50. The
number of DE genes on each of my comparisons increases exponentially up to a
plateau when k is high. Likewise, the common dispersion decreases with increasing k. It looks so good both in terms of PCA and DE genes that I wonder if using such high k values might have induced false interpretations or high number of false positives.
I came to ask myself such questions also because on the RUVSeq manual, the
given example is k=1 and I wondered why this is the case if increasing k
improves the results.

I would be grateful if you could provide me with any feedback on this.

Thanks!

rnaseq ruvseq RUVr edgeR • 3.8k views

ADD COMMENT • link 8.4 years ago David R ▴ 90

score 1 · Answer 1 · 2015-12-04

Hi David,

what set of negative control genes are you using?

What you are observing is not too surprising, if you are using a large set of negative controls. RUVr assumes that the factors of unwanted variation are orthogonal to the factor of interest, so with such a large number of factors you are probably removing all the variation that is not explained by the factor of interest. Hence, you get smaller dispersion parameters, and more DE genes.

Note that the fact that you have more DE genes does not mean that the data are well normalized. A large fraction of them is likely to be made of false positives. A better way of deciding how many factors to use in your dataset is to look at the behavior of positive and negative controls (i.e., genes that you know - or suspect - to be DE and non DE, respectively) at different values of k. If you see that the fraction of DE positive controls increases, while the fraction of DE negative controls doesn't, than you are on the right track. (Note that the negative controls that you use for testing k should be different from the one you use to estimate the factors of RUV).

Finally, although this is largely an empirical observation, usually a few (2-3) factors are enough to capture the unwanted variation. In very noisy datasets you can increase to maybe 5 or 10, but 50 sounds definitely too many.

score 0 · Answer 2 · 2015-12-09

0

Entering edit mode

David R ▴ 90

@david-rengel-6321

Last seen 4 months ago

European Union

Hi Davide,

Thanks a lot for the answer. I thought your answer would have been mailed to me, that is why I had not replied.

Actually, I am not working with negative controls for several reasons. Should I? I mean, negative controls are not meant to be used under RUVg? It is not so obvious for me to find non modulated robust genes, especially in the 72-sample project. Indeed, that is why I chose RUVr.

Nevertheless, I am verifying some candidate genes that are actually meant to be modulated. In some other project (not the one with 72 samples) some genes are actually being tested by qPCE as I write. I'll see how those ones behave.

I would appreciate any help with regard to the negative controls.

Kind regards,

David

ADD COMMENT • link 8.4 years ago David R ▴ 90

0

Entering edit mode

Although RUVr and RUVs are more robust to the choice of negative controls, they still formally require you to choose a set of such genes. Since, as I said, RUVr is robust to some negative controls not being really "negative", I would suggest that you try with a "general" set of genes, such as the list of housekeeping genes that you can find here:

http://www.stat.berkeley.edu/~johann/ruv/resources/hk.txt

We have good experience with using housekeeping genes as negative controls, in general.

ADD REPLY • link 8.4 years ago davide risso ▴ 950

0

Entering edit mode

Thanks Davide. I'll have a look at HKG, though it is not that straignt forward: the species I am dealing with has not been so thouroughly studied. And I wil certainly reduce the number of k variables!

Best,

ADD REPLY • link 8.4 years ago David R ▴ 90