Entering edit mode
Meritxell Oliva
▴
120
@meritxell-oliva-6129
Last seen 10.2 years ago
Dear Bioconductor list & Jeff Leek,
I am using sva to estimate potential surrogate variables of a
microarray derived expression dataset, as a previous step to perform
differential gene expression analysis. The aim of my work is to study
how one multifactorial variable ( inversion genotype, three
categories -> STD,HET,INV ) is associated to the gene expression
profile of a set of human individuals. However, there are some other
variables ( eg. population, gender ) with a partial effect, that is,
they account for variation in the expression of a subset of genes. I
don't know how to deal with these variables. Which of the following
options is the most appropriate one (if any) ?
A) "Protect" them by their inclusion in the both the null and and full
model
mod0 = model.matrix(~as.factor(Gender)+as.factor(Population),
data=pheno)
mod = model.matrix(~as.factor(inversion_genotype)+as.factor(Gender)+as
.factor(Population), data=pheno)
svobj = sva(edata,mod,mod0)
B) Include them only in the full model
mod0 = model.matrix(~1, data=pheno)
mod = model.matrix(~as.factor(inversion_genotype)+as.factor(Gender)+as
.factor(Population)+, data=pheno)
svobj = sva(edata,mod,mod0)
C) Not include them at all ( and expect to get some surrogate
variables with strong correlation with these variables, in case they
really affect gene expression )
mod0 = model.matrix(~1, data=pheno)
mod = model.matrix(~as.factor(inversion_genotype), data=pheno)
svobj = sva(edata,mod,mod0)
To summarize: how should adjustment variables with global effect be
treated? how should adjustment variables with partial effect ( only in
a subset of genes ) be treated?
I would really appreciate any piece of advice.
Meri
-- output of sessionInfo():
R version 2.15.2 (2012-10-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
--
Meritxell Oliva
PhD student
IBB (Biotechnology and Biomedicine Institute)
Comparative and Functional Genomics group
Campus Universitari - 08193 Bellaterra Cerdanyola del Vallès -
Barcelona
[[alternative HTML version deleted]]
Dear Olivia, dear Jeff,
though this thread is a bit dated I am facing the same problem here and maybe Olivia has found and answer since a long time (I certainly hope so for you :)
I am wondering how you determined which factors do have an influence on your data in the first place? How can we determine the influence of surrogate factors on our data before letting SVA correct them?
I would love to hear your answers!
Best,
Sebastian