Removing batch effects with sva and combat using the sva package

6

Entering edit mode

Jeff Leek ▴ 650

@jeff-leek-5015

Last seen 3.7 years ago

United States

The latest version of the sva package is now available at Bioconductor: http://bioconductor.org/packages/release/bioc/html/sva.html This version includes support for both surrogate variable analysis, as described in the papers: http://www.biostat.jhsph.edu/~jleek/papers/sva.pdf http://www.biostat.jhsph.edu/~jleek/papers/framework.pdf and for Combat, an approach for removing batch effects when the source of batch is known as described in the paper: http://biostatistics.oxfordjournals.org/content/early/2006/04/21/biost atistics.kxj037.full.pdf A full description of how to use the methods, including how to use the sva package with limma, removing batch effects with linear models, biological versus technical batch effects, direct adjustment versus surrogate variable adjustment, and batch effects for prediction is available here: http://bioconductor.org/packages/2.9/bioc/vignettes/sva/inst/doc/sva.p df Several recent questions have focused on removing batch effects from gene expression or other high-throughput data as a cleaning step prior to performing other analyses. An important point about batch effect correction (whether with sva, combat, or any other currently published approach) is that a regression analysis is performed and variation is removed from the data. So subsequent analyses using a "cleaned" version of the data should be performed with caution. In particular, methods use to infer networks or to illustrate patterns (MDS/PCA) should be used with caution after regressing out batch effects. All currently published batch effect removal methods focus on adjusting batch effects for differential expression. That being said, the sva package can be used to "clean" a data set as follows: (1) use the sva() function as described in the vignette to run sva and store the sva object. (2) input the data set into the fsva() function, along with the model matrix used to define the sva object, and the surrogate variable object. The db variable that is returned from this command will be a "clean" version of the original data set. [[alternative HTML version deleted]]

Regression limma sva Regression limma sva • 7.4k views

ADD COMMENT • link updated 12.9 years ago by Andrew Teschendorff ▴ 60 • written 12.9 years ago by Jeff Leek ▴ 650

0

Entering edit mode

What is the output data format after applying comBat? More specifically, when applying linear models to remove batch effects one usually ends up with the residuals of the original data (expression data in my case). I would like to know what type of processing, normalization or transformations are done to the data when applying comBat.

ADD REPLY • link 7.3 years ago DataFanatic ▴ 10

2

Entering edit mode

Andrew Teschendorff ▴ 60

@andrew-teschendorff-4903

Last seen 5.7 years ago

Dear Bioconductors, In addition to Jeff's detailed explanation I think that it is also important to make the community aware of some of the potential pitfalls associated with SVA methodologies. In particular, SVA assumes a model (e.g linear model) between the phenotype of interest and the actual data. If this model is not an accurate reflection of the true (unkown) model, then one can be faced with the scenario where biologically interesting variation (i.e variation associated with your phenotype of interest) is still present in the surrogate variable subspace, so subsequent adjustment for these specific surrogate variables could then result in an unreasonably weak biological signal. Examples which demonstrate this "breakdown" scenario are described in http://www.ncbi.nlm.nih.gov/pubmed/21471010 http://bioinformatics.oxfordjournals.org/content/27/11/1496.long So, it is important to check a posteriori that the inferred surrogate variables are not correlating strongly with your phenotype of interest. If they are, then it may be dangerous to include them in your subsequent supervised regression analysis. Incorporation of a surrogate variable selection step may therefore be necessary. How to perform this surrogate variable selection step in the case where confounders are known is described in the above paper. kind regards A. ********************************************************************** ********************************************************************** *** Andrew E Teschendorff PhD Heller Research Fellow Statistical Cancer Genomics Paul O'Gorman Building UCL Cancer Institute University College London 72 Huntley Street London WC1E 6BT, UK. Mob: +44 07876 561263 Email: a.teschendorff at ucl.ac.uk http://www.ucl.ac.uk/cancer/research- groups/statistical_cancer_genomics/index.htm ********************************************************************** ********************************************************************** ________________________________________ From: bioconductor-bounces@r-project.org [bioconductor- bounces@r-project.org] On Behalf Of Jeff Leek [jtleek@gmail.com] Sent: 15 December 2011 17:16 To: bioconductor at r-project.org Subject: [BioC] Removing batch effects with sva and combat using the sva package The latest version of the sva package is now available at Bioconductor: http://bioconductor.org/packages/release/bioc/html/sva.html This version includes support for both surrogate variable analysis, as described in the papers: http://www.biostat.jhsph.edu/~jleek/papers/sva.pdf http://www.biostat.jhsph.edu/~jleek/papers/framework.pdf and for Combat, an approach for removing batch effects when the source of batch is known as described in the paper: http://biostatistics.oxfordjournals.org/content/early/2006/04/21/biost atistics.kxj037.full.pdf A full description of how to use the methods, including how to use the sva package with limma, removing batch effects with linear models, biological versus technical batch effects, direct adjustment versus surrogate variable adjustment, and batch effects for prediction is available here: http://bioconductor.org/packages/2.9/bioc/vignettes/sva/inst/doc/sva.p df Several recent questions have focused on removing batch effects from gene expression or other high-throughput data as a cleaning step prior to performing other analyses. An important point about batch effect correction (whether with sva, combat, or any other currently published approach) is that a regression analysis is performed and variation is removed from the data. So subsequent analyses using a "cleaned" version of the data should be performed with caution. In particular, methods use to infer networks or to illustrate patterns (MDS/PCA) should be used with caution after regressing out batch effects. All currently published batch effect removal methods focus on adjusting batch effects for differential expression. That being said, the sva package can be used to "clean" a data set as follows: (1) use the sva() function as described in the vignette to run sva and store the sva object. (2) input the data set into the fsva() function, along with the model matrix used to define the sva object, and the surrogate variable object. The db variable that is returned from this command will be a "clean" version of the original data set. [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.9 years ago Andrew Teschendorff ▴ 60

0

Entering edit mode

Thanks to Andrew for pointing out some key issues with any batch correction approach: (1) If you do not want to bias your significance analysis, you must remove both surrogate variables that are uncorrelated with the phenotype you care about and surrogate variables that are correlated with the phenotype you care about. Otherwise you will bias your significance analysis. This is the reason that directly using PCA without supervision is often dangerous for batch removal, as are algorithms that require you to pick the surrogate variables you want to remove by eye. (2) As described in the sva vignette on the Bioconductor download site, removing surrogate variables may remove biological variation. This is a byproduct of any unsupervised approach to removing batch effects - pca, svd, ica, etc. It is good to keep in mind that some batch effects may be biological, but if you care about biological variation within phenotype groups, direct removal of batch effects with Combat or a linear model may be most appropriate. On Thu, Dec 15, 2011 at 1:00 PM, Andrew Teschendorff < a.teschendorff@ucl.ac.uk> wrote: > > Dear Bioconductors, > > In addition to Jeff's detailed explanation I think that it is also > important to make the community aware > of some of the potential pitfalls associated with SVA methodologies. In > particular, SVA assumes a model (e.g > linear model) between the phenotype of interest and the actual data. If > this model is not an accurate reflection > of the true (unkown) model, then one can be faced with the scenario where > biologically interesting variation (i.e > variation associated with your phenotype of interest) is still present in > the surrogate variable subspace, so subsequent > adjustment for these specific surrogate variables could then result in an > unreasonably weak biological signal. Examples which > demonstrate this "breakdown" scenario are described in > > http://www.ncbi.nlm.nih.gov/pubmed/21471010 > http://bioinformatics.oxfordjournals.org/content/27/11/1496.long > > So, it is important to check a posteriori that the inferred surrogate > variables are not correlating strongly with your phenotype of interest. > If they are, then it may be dangerous to include them in your subsequent > supervised regression analysis. Incorporation of a surrogate > variable selection step may therefore be necessary. How to perform this > surrogate variable selection step in the case where confounders > are known is described in the above paper. > > kind regards > A. > > > ******************************************************************** ********************************************************************** ***** > Andrew E Teschendorff PhD > Heller Research Fellow > Statistical Cancer Genomics > Paul O'Gorman Building > UCL Cancer Institute > University College London > 72 Huntley Street > London WC1E 6BT, UK. > > Mob: +44 07876 561263 > Email: a.teschendorff@ucl.ac.uk > > http://www.ucl.ac.uk/cancer/research- groups/statistical_cancer_genomics/index.htm > > ******************************************************************** ********************************************************************** ** > ________________________________________ > From: bioconductor-bounces@r-project.org [ > bioconductor-bounces@r-project.org] On Behalf Of Jeff Leek [ > jtleek@gmail.com] > Sent: 15 December 2011 17:16 > To: bioconductor@r-project.org > Subject: [BioC] Removing batch effects with sva and combat using the sva > package > > The latest version of the sva package is now available at Bioconductor: > > http://bioconductor.org/packages/release/bioc/html/sva.html > > This version includes support for both surrogate variable analysis, as > described in the papers: > > http://www.biostat.jhsph.edu/~jleek/papers/sva.pdf > http://www.biostat.jhsph.edu/~jleek/papers/framework.pdf > > and for Combat, an approach for removing batch effects when the source of > batch is known as described in the paper: > > > http://biostatistics.oxfordjournals.org/content/early/2006/04/21/bio statistics.kxj037.full.pdf > > A full description of how to use the methods, including how to use the sva > package with limma, removing batch effects with linear models, biological > versus technical batch effects, direct adjustment versus surrogate variable > adjustment, and batch effects for prediction is available here: > > http://bioconductor.org/packages/2.9/bioc/vignettes/sva/inst/doc/sva .pdf > > Several recent questions have focused on removing batch effects from gene > expression or other high-throughput data as a cleaning step prior to > performing other analyses. An important point about batch effect correction > (whether with sva, combat, or any other currently published approach) is > that a regression analysis is performed and variation is removed from the > data. So subsequent analyses using a "cleaned" version of the data should > be performed with caution. In particular, methods use to infer networks or > to illustrate patterns (MDS/PCA) should be used with caution after > regressing out batch effects. All currently published batch effect removal > methods focus on adjusting batch effects for differential expression. > > That being said, the sva package can be used to "clean" a data set as > follows: (1) use the sva() function as described in the vignette to run sva > and store the sva object. (2) input the data set into the fsva() function, > along with the model matrix used to define the sva object, and the > surrogate variable object. The db variable that is returned from this > command will be a "clean" version of the original data set. > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 12.9 years ago Jeff Leek ▴ 650

Login before adding your answer.