I am trying to estimate sources of heterogeneity in methylation data in addition to some known sources (i.e., I have batch and age but would also like to correct for smoking, unmeasured technical artifacts, and cellular heterogeneity). When I use num.sv and the default "be" method, I get 12 SVs; when I specify the "leek" method, I get 0 SVs. Is there a reason why the two methods might behave so differently?
I am confused about whether one method is generally recommended over the other, as the SVA vignette shows an example with "leek": https://www.bioconductor.org/packages/devel/bioc/vignettes/sva/inst/doc/sva.pdf
...while the documentation for the SVA command defaults to "be" if a number is not specified and cautions that the "numSVmethod" parameter "... should not be adapted by the user unless they are an expert": https://www.rdocumentation.org/packages/sva/versions/3.20.0/topics/sva
My question is partially answered here: svaseq: how many and which surrogate variables to pick, and maybe there is not a "best" way to estimate the number of SVs to include. Still, I would like to better understand the differences between the two methods.
Thanks,
Brooke
Hi Brooke,
Did you get your answer? I have the same question. I am working on TCGA breast cancer DNA methylation data. I downloaded the beta values, and then converted into M-values. When I applied "be" method, I got 94 surrogate variables, while using "leek", I got 3. I am not sure which one to choose and do further analysis.
Thanks
Srikant
Hi Srikant,
Sorry I didn't see your message earlier! I found this paper to be helpful in deciding what to do:
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0808-5
As the article points out, a large number of surrogate variables may be capturing more than you intend. Plus, including a huge number of covariates in a regression model is not ideal. In your case, I would definitely go for the method that resulted in 3 SVs rather than 94!
-Brooke