Question

Back-estimating batch variables from SVA for ComBat?

0

Entering edit mode

Andrew Jaffe ▴ 120

@andrew-jaffe-4820

Last seen 10.2 years ago

Hollis, There's no need to "estimate" the batch inputs for ComBat after running SVA - you can just adjust for the surrogate variables themselves in any downstream analyses. For something like WGCNA, or any other clustering algorithm, you can regress the surrogate variables out of the expression data (see code below). In order to minimize the loss of biological signal, be sure to properly specify your model matrix ('mod') prior to running SVA - any variable you put in the model will be "protected" from being treated as unexplained heterogeneity that SVA estimates. Specifying the model matrix might be a little tricky given the potential longitudinal nature of the data, but you can be fairly flexible with the linear model that you input into SVA. You could also try manually inputting the estimation of fewer surrogate variables if you want to be conservative about removing biology, at the expense of possibly preserving some unexplained heterogeneity using the "n.sv" argument in SVA When you regress the surrogate variables out of the expression data, be sure to fit the entire model including your covariates of interest (and not just the surrogate variables) - otherwise, the cleaned data will not look as good. You can use the following function: cleaningY = function(y, mod, svaobj) { X=cbind(mod,svaobj$sv) Hat=solve(t(X)%*%X)%*%t(X) beta=(Hat%*%t(y)) P=ncol(mod) cleany=y-t(as.matrix(X[,-c(1:P)])%*%beta[-c(1:P),]) return(cleany) } # and implement it like this: mod = model.matrix(~[whatever your model is]) # specify the model svaobj = sva(y, mod) # y is your expression matrix cleany = cleaningY(y,mod,svaobj) WGCNA(cleany,...) # or whatever the format is... Hope that helps, Andrew Message: 9 Date: Mon, 6 Aug 2012 14:50:03 -0700 From: Hollis Wright <wrighth@ohsu.edu> To: "bioconductor@r-project.org" <bioconductor@r-project.org> Subject: [BioC] Back-estimating batch variables from SVA for ComBat? Message-ID: <cfc8814f553202438e9a18dc1e512e2806f93ed0a0@ex-mb04.ohsu.edu> Content-Type: text/plain; charset="us-ascii" Hi, all; we're working with some gene expression data and we suspect that there may be some irregularities in arrays; unfortunately, these arrays were run some time ago and we're not actually sure what the batches (if any) they were run in were at this point; complicating matters, this is also a timepoint analysis so there's some technical variability there most likely. Long-run, we'd like to run WGCNA and some similar methods on this data and I'd had good luck with ComBat adjusting for batches in a previous case where we did have that information. Is there any way to use sva estimates to estimate what the batches "should" be for our data? sva does estimate and find three significant latent variables but I'm not sure what (if anything) can be done from there in terms of adjusting the expression levels to compensate for the latent variables; obviously we'd also be concerned about losing the biological variability if we make any adjustments. Is this possible, or am I heading i! n the wrong direction here? Hollis Wright, PhD Ojeda Lab, Division of Neuroscience Oregon Health and Science University [[alternative HTML version deleted]]

sva sva • 2.8k views

ADD COMMENT • link updated 12.3 years ago by wrighth ▴ 260 • written 12.3 years ago by Andrew Jaffe ▴ 120

score 0 · Answer 1 · 2012-08-07

Thanks, Andrew! We'll give this a shot; just wasn't sure if there was some way to apply ComBat directly or if you had to do something more along these lines. Hollis Wright, PhD Ojeda Lab, Division of Neuroscience Oregon Health and Science University ________________________________________ From: Andrew Jaffe [ajaffe@jhsph.edu] Sent: Tuesday, August 07, 2012 7:15 AM To: bioconductor at r-project.org Cc: Hollis Wright Subject: [BioC] Back-estimating batch variables from SVA for ComBat? Hollis, There's no need to "estimate" the batch inputs for ComBat after running SVA - you can just adjust for the surrogate variables themselves in any downstream analyses. For something like WGCNA, or any other clustering algorithm, you can regress the surrogate variables out of the expression data (see code below). In order to minimize the loss of biological signal, be sure to properly specify your model matrix ('mod') prior to running SVA - any variable you put in the model will be "protected" from being treated as unexplained heterogeneity that SVA estimates. Specifying the model matrix might be a little tricky given the potential longitudinal nature of the data, but you can be fairly flexible with the linear model that you input into SVA. You could also try manually inputting the estimation of fewer surrogate variables if you want to be conservative about removing biology, at the expense of possibly preserving some unexplained heterogeneity using the "n.sv<http: n.sv="">" argument in SVA When you regress the surrogate variables out of the expression data, be sure to fit the entire model including your covariates of interest (and not just the surrogate variables) - otherwise, the cleaned data will not look as good. You can use the following function: cleaningY = function(y, mod, svaobj) { X=cbind(mod,svaobj$sv) Hat=solve(t(X)%*%X)%*%t(X) beta=(Hat%*%t(y)) P=ncol(mod) cleany=y-t(as.matrix(X[,-c(1:P)])%*%beta[-c(1:P),]) return(cleany) } # and implement it like this: mod = model.matrix(~[whatever your model is]) # specify the model svaobj = sva(y, mod) # y is your expression matrix cleany = cleaningY(y,mod,svaobj) WGCNA(cleany,...) # or whatever the format is... Hope that helps, Andrew Message: 9 Date: Mon, 6 Aug 2012 14:50:03 -0700 From: Hollis Wright <wrighth@ohsu.edu<mailto:wrighth@ohsu.edu>> To: "bioconductor at r-project.org<mailto:bioconductor at="" r-project.org="">" <bioconductor at="" r-project.org<mailto:bioconductor="" at="" r-project.org="">> Subject: [BioC] Back-estimating batch variables from SVA for ComBat? Message-ID: <cfc8814f553202438e9a18dc1e512e2806f93ed0a0 at="" ex-="" mb04.ohsu.edu<mailto:cfc8814f553202438e9a18dc1e512e2806f93ed0a0="" at="" ex-="" mb04.ohsu.edu="">> Content-Type: text/plain; charset="us-ascii" Hi, all; we're working with some gene expression data and we suspect that there may be some irregularities in arrays; unfortunately, these arrays were run some time ago and we're not actually sure what the batches (if any) they were run in were at this point; complicating matters, this is also a timepoint analysis so there's some technical variability there most likely. Long-run, we'd like to run WGCNA and some similar methods on this data and I'd had good luck with ComBat adjusting for batches in a previous case where we did have that information. Is there any way to use sva estimates to estimate what the batches "should" be for our data? sva does estimate and find three significant latent variables but I'm not sure what (if anything) can be done from there in terms of adjusting the expression levels to compensate for the latent variables; obviously we'd also be concerned about losing the biological variability if we make any adjustments. Is this possible, or am I heading i! n the wrong direction here? Hollis Wright, PhD Ojeda Lab, Division of Neuroscience Oregon Health and Science University