Entering edit mode
Andrew Jaffe
▴
120
@andrew-jaffe-4820
Last seen 10.2 years ago
Hollis,
There's no need to "estimate" the batch inputs for ComBat after
running SVA
- you can just adjust for the surrogate variables themselves in any
downstream analyses. For something like WGCNA, or any other clustering
algorithm, you can regress the surrogate variables out of the
expression
data (see code below). In order to minimize the loss of biological
signal,
be sure to properly specify your model matrix ('mod') prior to running
SVA
- any variable you put in the model will be "protected" from being
treated
as unexplained heterogeneity that SVA estimates. Specifying the model
matrix might be a little tricky given the potential longitudinal
nature of
the data, but you can be fairly flexible with the linear model that
you
input into SVA. You could also try manually inputting the estimation
of
fewer surrogate variables if you want to be conservative about
removing
biology, at the expense of possibly preserving some
unexplained heterogeneity using the "n.sv" argument in SVA
When you regress the surrogate variables out of the expression data,
be
sure to fit the entire model including your covariates of interest
(and not
just the surrogate variables) - otherwise, the cleaned data will not
look
as good. You can use the following function:
cleaningY = function(y, mod, svaobj) {
X=cbind(mod,svaobj$sv)
Hat=solve(t(X)%*%X)%*%t(X)
beta=(Hat%*%t(y))
P=ncol(mod)
cleany=y-t(as.matrix(X[,-c(1:P)])%*%beta[-c(1:P),])
return(cleany)
}
# and implement it like this:
mod = model.matrix(~[whatever your model is]) # specify the model
svaobj = sva(y, mod) # y is your expression matrix
cleany = cleaningY(y,mod,svaobj)
WGCNA(cleany,...) # or whatever the format is...
Hope that helps,
Andrew
Message: 9
Date: Mon, 6 Aug 2012 14:50:03 -0700
From: Hollis Wright <wrighth@ohsu.edu>
To: "bioconductor@r-project.org" <bioconductor@r-project.org>
Subject: [BioC] Back-estimating batch variables from SVA for ComBat?
Message-ID:
<cfc8814f553202438e9a18dc1e512e2806f93ed0a0@ex-mb04.ohsu.edu>
Content-Type: text/plain; charset="us-ascii"
Hi, all; we're working with some gene expression data and we suspect
that
there may be some irregularities in arrays; unfortunately, these
arrays
were run some time ago and we're not actually sure what the batches
(if
any) they were run in were at this point; complicating matters, this
is
also a timepoint analysis so there's some technical variability there
most
likely. Long-run, we'd like to run WGCNA and some similar methods on
this
data and I'd had good luck with ComBat adjusting for batches in a
previous
case where we did have that information. Is there any way to use sva
estimates to estimate what the batches "should" be for our data? sva
does
estimate and find three significant latent variables but I'm not sure
what
(if anything) can be done from there in terms of adjusting the
expression
levels to compensate for the latent variables; obviously we'd also be
concerned about losing the biological variability if we make any
adjustments. Is this possible, or am I heading i!
n the wrong direction here?
Hollis Wright, PhD
Ojeda Lab, Division of Neuroscience
Oregon Health and Science University
[[alternative HTML version deleted]]