The latest version of the sva package is now available at
Bioconductor:
http://bioconductor.org/packages/release/bioc/html/sva.html
This version includes support for both surrogate variable analysis, as
described in the papers:
http://www.biostat.jhsph.edu/~jleek/papers/sva.pdfhttp://www.biostat.jhsph.edu/~jleek/papers/framework.pdf
and for Combat, an approach for removing batch effects when the source
of
batch is known as described in the paper:
http://biostatistics.oxfordjournals.org/content/early/2006/04/21/biost
atistics.kxj037.full.pdf
A full description of how to use the methods, including how to use the
sva
package with limma, removing batch effects with linear models,
biological
versus technical batch effects, direct adjustment versus surrogate
variable
adjustment, and batch effects for prediction is available here:
http://bioconductor.org/packages/2.9/bioc/vignettes/sva/inst/doc/sva.p
df
Several recent questions have focused on removing batch effects from
gene
expression or other high-throughput data as a cleaning step prior to
performing other analyses. An important point about batch effect
correction
(whether with sva, combat, or any other currently published approach)
is
that a regression analysis is performed and variation is removed from
the
data. So subsequent analyses using a "cleaned" version of the data
should
be performed with caution. In particular, methods use to infer
networks or
to illustrate patterns (MDS/PCA) should be used with caution after
regressing out batch effects. All currently published batch effect
removal
methods focus on adjusting batch effects for differential expression.
That being said, the sva package can be used to "clean" a data set as
follows: (1) use the sva() function as described in the vignette to
run sva
and store the sva object. (2) input the data set into the fsva()
function,
along with the model matrix used to define the sva object, and the
surrogate variable object. The db variable that is returned from this
command will be a "clean" version of the original data set.
[[alternative HTML version deleted]]
What is the output data format after applying comBat? More specifically, when applying linear models to remove batch effects one usually ends up with the residuals of the original data (expression data in my case). I would like to know what type of processing, normalization or transformations are done to the data when applying comBat.
Dear Bioconductors,
In addition to Jeff's detailed explanation I think that it is also
important to make the community aware
of some of the potential pitfalls associated with SVA methodologies.
In particular, SVA assumes a model (e.g
linear model) between the phenotype of interest and the actual data.
If this model is not an accurate reflection
of the true (unkown) model, then one can be faced with the scenario
where biologically interesting variation (i.e
variation associated with your phenotype of interest) is still
present in the surrogate variable subspace, so subsequent
adjustment for these specific surrogate variables could then result
in an unreasonably weak biological signal. Examples which
demonstrate this "breakdown" scenario are described in
http://www.ncbi.nlm.nih.gov/pubmed/21471010http://bioinformatics.oxfordjournals.org/content/27/11/1496.long
So, it is important to check a posteriori that the inferred surrogate
variables are not correlating strongly with your phenotype of
interest.
If they are, then it may be dangerous to include them in your
subsequent supervised regression analysis. Incorporation of a
surrogate
variable selection step may therefore be necessary. How to perform
this surrogate variable selection step in the case where confounders
are known is described in the above paper.
kind regards
A.
**********************************************************************
**********************************************************************
***
Andrew E Teschendorff PhD
Heller Research Fellow
Statistical Cancer Genomics
Paul O'Gorman Building
UCL Cancer Institute
University College London
72 Huntley Street
London WC1E 6BT, UK.
Mob: +44 07876 561263
Email: a.teschendorff at ucl.ac.uk
http://www.ucl.ac.uk/cancer/research-
groups/statistical_cancer_genomics/index.htm
**********************************************************************
**********************************************************************
________________________________________
From: bioconductor-bounces@r-project.org [bioconductor-
bounces@r-project.org] On Behalf Of Jeff Leek [jtleek@gmail.com]
Sent: 15 December 2011 17:16
To: bioconductor at r-project.org
Subject: [BioC] Removing batch effects with sva and combat using the
sva package
The latest version of the sva package is now available at
Bioconductor:
http://bioconductor.org/packages/release/bioc/html/sva.html
This version includes support for both surrogate variable analysis, as
described in the papers:
http://www.biostat.jhsph.edu/~jleek/papers/sva.pdfhttp://www.biostat.jhsph.edu/~jleek/papers/framework.pdf
and for Combat, an approach for removing batch effects when the source
of
batch is known as described in the paper:
http://biostatistics.oxfordjournals.org/content/early/2006/04/21/biost
atistics.kxj037.full.pdf
A full description of how to use the methods, including how to use the
sva
package with limma, removing batch effects with linear models,
biological
versus technical batch effects, direct adjustment versus surrogate
variable
adjustment, and batch effects for prediction is available here:
http://bioconductor.org/packages/2.9/bioc/vignettes/sva/inst/doc/sva.p
df
Several recent questions have focused on removing batch effects from
gene
expression or other high-throughput data as a cleaning step prior to
performing other analyses. An important point about batch effect
correction
(whether with sva, combat, or any other currently published approach)
is
that a regression analysis is performed and variation is removed from
the
data. So subsequent analyses using a "cleaned" version of the data
should
be performed with caution. In particular, methods use to infer
networks or
to illustrate patterns (MDS/PCA) should be used with caution after
regressing out batch effects. All currently published batch effect
removal
methods focus on adjusting batch effects for differential expression.
That being said, the sva package can be used to "clean" a data set as
follows: (1) use the sva() function as described in the vignette to
run sva
and store the sva object. (2) input the data set into the fsva()
function,
along with the model matrix used to define the sva object, and the
surrogate variable object. The db variable that is returned from this
command will be a "clean" version of the original data set.
[[alternative HTML version deleted]]
_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
Thanks to Andrew for pointing out some key issues with any batch
correction
approach:
(1) If you do not want to bias your significance analysis, you must
remove
both surrogate variables that are uncorrelated with the phenotype you
care
about and surrogate variables that are correlated with the phenotype
you
care about. Otherwise you will bias your significance analysis. This
is the
reason that directly using PCA without supervision is often dangerous
for
batch removal, as are algorithms that require you to pick the
surrogate
variables you want to remove by eye.
(2) As described in the sva vignette on the Bioconductor download
site,
removing surrogate variables may remove biological variation. This is
a
byproduct of any unsupervised approach to removing batch effects -
pca,
svd, ica, etc. It is good to keep in mind that some batch effects may
be
biological, but if you care about biological variation within
phenotype
groups, direct removal of batch effects with Combat or a linear model
may
be most appropriate.
On Thu, Dec 15, 2011 at 1:00 PM, Andrew Teschendorff <
a.teschendorff@ucl.ac.uk> wrote:
>
> Dear Bioconductors,
>
> In addition to Jeff's detailed explanation I think that it is also
> important to make the community aware
> of some of the potential pitfalls associated with SVA
methodologies. In
> particular, SVA assumes a model (e.g
> linear model) between the phenotype of interest and the actual
data. If
> this model is not an accurate reflection
> of the true (unkown) model, then one can be faced with the scenario
where
> biologically interesting variation (i.e
> variation associated with your phenotype of interest) is still
present in
> the surrogate variable subspace, so subsequent
> adjustment for these specific surrogate variables could then result
in an
> unreasonably weak biological signal. Examples which
> demonstrate this "breakdown" scenario are described in
>
> http://www.ncbi.nlm.nih.gov/pubmed/21471010
> http://bioinformatics.oxfordjournals.org/content/27/11/1496.long
>
> So, it is important to check a posteriori that the inferred
surrogate
> variables are not correlating strongly with your phenotype of
interest.
> If they are, then it may be dangerous to include them in your
subsequent
> supervised regression analysis. Incorporation of a surrogate
> variable selection step may therefore be necessary. How to perform
this
> surrogate variable selection step in the case where confounders
> are known is described in the above paper.
>
> kind regards
> A.
>
>
> ********************************************************************
**********************************************************************
*****
> Andrew E Teschendorff PhD
> Heller Research Fellow
> Statistical Cancer Genomics
> Paul O'Gorman Building
> UCL Cancer Institute
> University College London
> 72 Huntley Street
> London WC1E 6BT, UK.
>
> Mob: +44 07876 561263
> Email: a.teschendorff@ucl.ac.uk
>
> http://www.ucl.ac.uk/cancer/research-
groups/statistical_cancer_genomics/index.htm
>
> ********************************************************************
**********************************************************************
**
> ________________________________________
> From: bioconductor-bounces@r-project.org [
> bioconductor-bounces@r-project.org] On Behalf Of Jeff Leek [
> jtleek@gmail.com]
> Sent: 15 December 2011 17:16
> To: bioconductor@r-project.org
> Subject: [BioC] Removing batch effects with sva and combat using the
sva
> package
>
> The latest version of the sva package is now available at
Bioconductor:
>
> http://bioconductor.org/packages/release/bioc/html/sva.html
>
> This version includes support for both surrogate variable analysis,
as
> described in the papers:
>
> http://www.biostat.jhsph.edu/~jleek/papers/sva.pdf
> http://www.biostat.jhsph.edu/~jleek/papers/framework.pdf
>
> and for Combat, an approach for removing batch effects when the
source of
> batch is known as described in the paper:
>
>
> http://biostatistics.oxfordjournals.org/content/early/2006/04/21/bio
statistics.kxj037.full.pdf
>
> A full description of how to use the methods, including how to use
the sva
> package with limma, removing batch effects with linear models,
biological
> versus technical batch effects, direct adjustment versus surrogate
variable
> adjustment, and batch effects for prediction is available here:
>
> http://bioconductor.org/packages/2.9/bioc/vignettes/sva/inst/doc/sva
.pdf
>
> Several recent questions have focused on removing batch effects from
gene
> expression or other high-throughput data as a cleaning step prior to
> performing other analyses. An important point about batch effect
correction
> (whether with sva, combat, or any other currently published
approach) is
> that a regression analysis is performed and variation is removed
from the
> data. So subsequent analyses using a "cleaned" version of the data
should
> be performed with caution. In particular, methods use to infer
networks or
> to illustrate patterns (MDS/PCA) should be used with caution after
> regressing out batch effects. All currently published batch effect
removal
> methods focus on adjusting batch effects for differential
expression.
>
> That being said, the sva package can be used to "clean" a data set
as
> follows: (1) use the sva() function as described in the vignette to
run sva
> and store the sva object. (2) input the data set into the fsva()
function,
> along with the model matrix used to define the sva object, and the
> surrogate variable object. The db variable that is returned from
this
> command will be a "clean" version of the original data set.
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]
What is the output data format after applying comBat? More specifically, when applying linear models to remove batch effects one usually ends up with the residuals of the original data (expression data in my case). I would like to know what type of processing, normalization or transformations are done to the data when applying comBat.