Question

Difference between "be" and "leek" methods when deciding number of surrogate variables to estimate with SVA?

4

Entering edit mode

brhead ▴ 40

@brhead-11927

Last seen 6.4 years ago

I am trying to estimate sources of heterogeneity in methylation data in addition to some known sources (i.e., I have batch and age but would also like to correct for smoking, unmeasured technical artifacts, and cellular heterogeneity). When I use num.sv and the default "be" method, I get 12 SVs; when I specify the "leek" method, I get 0 SVs. Is there a reason why the two methods might behave so differently?

I am confused about whether one method is generally recommended over the other, as the SVA vignette shows an example with "leek": https://www.bioconductor.org/packages/devel/bioc/vignettes/sva/inst/doc/sva.pdf

...while the documentation for the SVA command defaults to "be" if a number is not specified and cautions that the "numSVmethod" parameter "... should not be adapted by the user unless they are an expert": https://www.rdocumentation.org/packages/sva/versions/3.20.0/topics/sva

My question is partially answered here: svaseq: how many and which surrogate variables to pick, and maybe there is not a "best" way to estimate the number of SVs to include. Still, I would like to better understand the differences between the two methods.

Thanks,

Brooke

sva num.sv • 3.4k views

ADD COMMENT • link 6.9 years ago brhead ▴ 40

0

Entering edit mode

Hi Brooke,

Did you get your answer? I have the same question. I am working on TCGA breast cancer DNA methylation data. I downloaded the beta values, and then converted into M-values. When I applied "be" method, I got 94 surrogate variables, while using "leek", I got 3. I am not sure which one to choose and do further analysis.

Thanks

Srikant

ADD REPLY • link 6.8 years ago vermasrikant • 0

0

Entering edit mode

Hi Srikant,

Sorry I didn't see your message earlier! I found this paper to be helpful in deciding what to do:

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0808-5

As the article points out, a large number of surrogate variables may be capturing more than you intend. Plus, including a huge number of covariates in a regression model is not ideal. In your case, I would definitely go for the method that resulted in 3 SVs rather than 94!

-Brooke

ADD REPLY • link 6.6 years ago brhead ▴ 40