sva: how to incorporate adjusting variables

7

Entering edit mode

Meritxell Oliva ▴ 120

@meritxell-oliva-6129

Last seen 9.9 years ago

Dear Bioconductor list & Jeff Leek, I am using sva to estimate potential surrogate variables of a microarray derived expression dataset, as a previous step to perform differential gene expression analysis. The aim of my work is to study how one multifactorial variable ( inversion genotype, three categories -> STD,HET,INV ) is associated to the gene expression profile of a set of human individuals. However, there are some other variables ( eg. population, gender ) with a partial effect, that is, they account for variation in the expression of a subset of genes. I don't know how to deal with these variables. Which of the following options is the most appropriate one (if any) ? A) "Protect" them by their inclusion in the both the null and and full model mod0 = model.matrix(~as.factor(Gender)+as.factor(Population), data=pheno) mod = model.matrix(~as.factor(inversion_genotype)+as.factor(Gender)+as .factor(Population), data=pheno) svobj = sva(edata,mod,mod0) B) Include them only in the full model mod0 = model.matrix(~1, data=pheno) mod = model.matrix(~as.factor(inversion_genotype)+as.factor(Gender)+as .factor(Population)+, data=pheno) svobj = sva(edata,mod,mod0) C) Not include them at all ( and expect to get some surrogate variables with strong correlation with these variables, in case they really affect gene expression ) mod0 = model.matrix(~1, data=pheno) mod = model.matrix(~as.factor(inversion_genotype), data=pheno) svobj = sva(edata,mod,mod0) To summarize: how should adjustment variables with global effect be treated? how should adjustment variables with partial effect ( only in a subset of genes ) be treated? I would really appreciate any piece of advice. Meri -- output of sessionInfo(): R version 2.15.2 (2012-10-26) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base -- Meritxell Oliva PhD student IBB (Biotechnology and Biomedicine Institute) Comparative and Functional Genomics group Campus Universitari - 08193 Bellaterra Cerdanyola del Vallès - Barcelona [[alternative HTML version deleted]]

inveRsion sva inveRsion sva • 5.1k views

ADD COMMENT • link updated 10.9 years ago by Jeff Leek ▴ 650 • written 10.9 years ago by Meritxell Oliva ▴ 120

4

Entering edit mode

Jeff Leek ▴ 650

@jeff-leek-5015

Last seen 3.4 years ago

United States

Hi Meritxell, The appropriate approach with sva is (A) since the known variables Population and Genotype will be used in the ultimate linear model you intend to fit to test for the effect of your variable of interest. Best, Jeff On Mon, Sep 2, 2013 at 12:43 PM, Meritxell Oliva <meritxellop@gmail.com>wrote: > Dear Bioconductor list & Jeff Leek, > I am using sva to estimate potential surrogate variables of a microarray > derived expression dataset, as a previous step to perform differential gene > expression analysis. The aim of my work is to study how one multifactorial > variable ( inversion genotype, three categories -> STD,HET,INV ) is > associated to the gene expression profile of a set of human individuals. > However, there are some other variables ( eg. population, gender ) with a > partial effect, that is, they account for variation in the expression of a > subset of genes. I don't know how to deal with these variables. Which of > the following options is the most appropriate one (if any) ? > > A) "Protect" them by their inclusion in the both the null and and full > model > > mod0 = model.matrix(~as.factor(Gender)+as.factor(Population), data=pheno) > mod = > model.matrix(~as.factor(inversion_genotype)+as.factor(Gender)+as.fac tor(Population), > data=pheno) > svobj = sva(edata,mod,mod0) > > B) Include them only in the full model > > mod0 = model.matrix(~1, data=pheno) > mod = > model.matrix(~as.factor(inversion_genotype)+as.factor(Gender)+as.fac tor(Population)+, > data=pheno) > svobj = sva(edata,mod,mod0) > > C) Not include them at all ( and expect to get some surrogate variables > with strong correlation with these variables, in case they really affect > gene expression ) > > mod0 = model.matrix(~1, data=pheno) > mod = model.matrix(~as.factor(inversion_genotype), data=pheno) > svobj = sva(edata,mod,mod0) > > To summarize: how should adjustment variables with global effect be > treated? how should adjustment variables with partial effect ( only in a > subset of genes ) be treated? > > I would really appreciate any piece of advice. > > Meri > -- output of sessionInfo(): > > R version 2.15.2 (2012-10-26) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > -- > > Meritxell Oliva > PhD student > IBB (Biotechnology and Biomedicine Institute) > Comparative and Functional Genomics group > Campus Universitari - 08193 Bellaterra Cerdanyola del VallÄs - Barcelona > > > > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 10.9 years ago Jeff Leek ▴ 650

1

Entering edit mode

Hi Jeff, First of all, thank you for your quick answer. I still have some doubts, though: I understand that option A) is the most appropriate choice if one wants to explicitly include variables Population and Genotype to the final linear model. However, this is not my ultimate goal, meaning that I do not want to globally test the variance in gene expression associated to these variables, by including them in my linear model. I just want , in general terms, to get rid of any source of expression variance not associated with my primary variable of interest ( InvGeno ), coming from both known and unknown sources, either variables with biological origin ( Population, Gender ) or non-biological origin ( Batch effect ). With that purpose in mind, it's not clear to me why option A) is a better strategy than option C). Could you provide further insights on that? Best, Meri On Sep 2, 2013, at 8:18 PM, Jeff Leek wrote: > Hi Meritxell, > > The appropriate approach with sva is (A) since the known variables Population and Genotype will be used in the ultimate linear model you intend to fit to test for the effect of your variable of interest. > > Best, > > Jeff > > > > On Mon, Sep 2, 2013 at 12:43 PM, Meritxell Oliva <meritxellop@gmail.com> wrote: > Dear Bioconductor list & Jeff Leek, > I am using sva to estimate potential surrogate variables of a microarray derived expression dataset, as a previous step to perform differential gene expression analysis. The aim of my work is to study how one multifactorial variable ( inversion genotype, three categories -> STD,HET,INV ) is associated to the gene expression profile of a set of human individuals. However, there are some other variables ( eg. population, gender ) with a partial effect, that is, they account for variation in the expression of a subset of genes. I don't know how to deal with these variables. Which of the following options is the most appropriate one (if any) ? > > A) "Protect" them by their inclusion in the both the null and and full model > > mod0 = model.matrix(~as.factor(Gender)+as.factor(Population), data=pheno) > mod = model.matrix(~as.factor(inversion_genotype)+as.factor(Gender)+ as.factor(Population), data=pheno) > svobj = sva(edata,mod,mod0) > > B) Include them only in the full model > > mod0 = model.matrix(~1, data=pheno) > mod = model.matrix(~as.factor(inversion_genotype)+as.factor(Gender)+ as.factor(Population)+, data=pheno) > svobj = sva(edata,mod,mod0) > > C) Not include them at all ( and expect to get some surrogate variables with strong correlation with these variables, in case they really affect gene expression ) > > mod0 = model.matrix(~1, data=pheno) > mod = model.matrix(~as.factor(inversion_genotype), data=pheno) > svobj = sva(edata,mod,mod0) > > To summarize: how should adjustment variables with global effect be treated? how should adjustment variables with partial effect ( only in a subset of genes ) be treated? > > I would really appreciate any piece of advice. > > Meri > -- output of sessionInfo(): > > R version 2.15.2 (2012-10-26) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > -- > > Meritxell Oliva > PhD student > IBB (Biotechnology and Biomedicine Institute) > Comparative and Functional Genomics group > Campus Universitari - 08193 Bellaterra Cerdanyola del VallÄs - Barcelona > > > > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > Meritxell Oliva PhD student IBB (Biotechnology and Biomedicine Institute) Comparative and Functional Genomics group Campus Universitari - 08193 Bellaterra Cerdanyola del VallÃ¨s - Barcelona [[alternative HTML version deleted]]

ADD REPLY • link 10.9 years ago Meritxell Oliva ▴ 120

0

Entering edit mode

Dear Olivia, dear Jeff,

though this thread is a bit dated I am facing the same problem here and maybe Olivia has found and answer since a long time (I certainly hope so for you :)

I am wondering how you determined which factors do have an influence on your data in the first place? How can we determine the influence of surrogate factors on our data before letting SVA correct them?

I would love to hear your answers!
Best,
Sebastian

ADD REPLY • link 5.7 years ago Sebastian Hesse ▴ 70

Login before adding your answer.