I have a general question concerning surrogate variable analysis.
I have a large RNAseq data set on a heterogenous population and I'd like to identify the major hidden sources of variation so that I can adjust for them when performing differential gene expression analysis. svaseq() from the sva package finds 33 significant surrogate variables - that is a lot, I don't want to include all of them in my model. Apparently, previously the sva package had a function called svaplot() that allowed you do visualize the percent of variation explained by each surrogate variable (I envision something like a screeplot), but that function is not included in the package anymore.
So my question is: how do I pick the surrogate variables that explain most of the variation? And how do I determine what a good number of variables to pick is?
Thanks,
Doro
Also wondering the same thing. Did you find an answer, nicklesd?