Question: surrogate variables from sva independent of model variables?
1
pixie10 wrote:

Hi,

I'm trying to get understand the full vs null model specifications in the sva package.
My understanding is the following:

full model = Variable-of-interest + all confounders
null model = all confounders

The resulting SVs should then capture noise that is not captured by my confounders. Is this correct?
What I observe though is that when I run sva, the resulting SVs relate stronger to the confounders, when these were included in the model, vs a scenario, where I didn't include the confounders in the models.

Below a reproducible example using the bladderbatch data. "batch" is in this case a measured confounder (not my variable-of-interest).

library(sva)
library(pamr)
library(limma)

mod = model.matrix(~as.factor(cancer), data=pheno)
mod0 = model.matrix(~1,data=pheno)
n.sv = num.sv(edata,mod,method="leek")
n.sv
 2

svobj = sva(edata,mod,mod0,n.sv=n.sv)

Number of significant surrogate variables is:  2
Iteration (out of 5 ):1  2  3  4  5  

Now, let's see how the derived SVs relate to the batch variable (a covariate not included in the model).

summary(aov(svobj$sv ~ as.factor(pheno$batch)))

Response 1 :
Df  Sum Sq  Mean Sq F value    Pr(>F)
as.factor(pheno$batch) 4 0.34913 0.087284 6.9734 0.0001426 *** Residuals 52 0.65087 0.012517 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Response 2 : Df Sum Sq Mean Sq F value Pr(>F) as.factor(pheno$batch)  4 0.29016 0.072541  5.3141 0.001153 **
Residuals              52 0.70984 0.013651
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Ok, let's include "batch" as a covariate and see whether the derived SVs are independent of that covariate.

mod = model.matrix(~as.factor(cancer) + as.factor(batch), data=pheno)
mod0 = model.matrix(~as.factor(batch),data=pheno)
svobj = sva(edata,mod,mod0,n.sv=n.sv)

Number of significant surrogate variables is:  2
Iteration (out of 5 ):1  2  3  4  5

summary(aov(svobj$sv ~ as.factor(pheno$batch)))

Response 1 :
Df  Sum Sq  Mean Sq F value    Pr(>F)
as.factor(pheno$batch) 4 0.36353 0.090882 7.425 8.269e-05 *** Residuals 52 0.63647 0.012240 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Response 2 : Df Sum Sq Mean Sq F value Pr(>F) as.factor(pheno$batch)  4 0.42935 0.107339  9.7812 5.631e-06 ***
Residuals              52 0.57065 0.010974
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

I don't get why the association between the SVs and batch is stronger in the second example, in which batch was included in the models. I thought batch should have been "protected". (note: in my own data, the covariate is age; I just used batch here for reproducibility)

Thanks, Esther

I have the same problem. Did you manage to find an answer Esther?