svaseq: how many and which surrogate variables to pick
Entering edit mode
nicklesd ▴ 10
Last seen 7.0 years ago
United States

I have a general question concerning surrogate variable analysis.

I have a large RNAseq data set on a heterogenous population and I'd like to identify the major hidden sources of variation so that I can adjust for them when performing differential gene expression analysis. svaseq() from the sva package finds 33 significant surrogate variables - that is a lot, I don't want to include all of  them in my model. Apparently, previously the sva package had a function called svaplot()  that allowed you do visualize the percent of variation explained by each surrogate variable (I envision something like a screeplot), but that function is not included in the package anymore. 

So my question is: how do I pick the surrogate variables that explain most of the variation? And how do I determine what a good number of variables to pick is? 




sva • 3.8k views
Entering edit mode

Also wondering the same thing.   Did you find an answer, nicklesd?

Entering edit mode
Jeff Leek ▴ 640
Last seen 17 months ago
United States

You could try the alternative of using method = "be" in the software, that sometimes is a little better if the sample size of your experiment is very large. I removed the svaplot() function because it is a bit hard to judge how many surrogate variables to include by eye and while the automated ways aren't entirely better, at least they are reproducible. 

If you have a measured batch effect, one way some people select the number of surrogate variables is to pick the number of batches - but again that is a bit of a hack. 

To be honest this is a quite hard and open problem in the analysis of data from these experiments - how many artifact estimates to include. 




Login before adding your answer.

Traffic: 205 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6