Question

ComBat: 3 adjustment variables & continuous adjustment variables

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

Hi! I'm writing with a few questions about applying ComBat (sva package) to a set of ~180 samples run on the the Illumina Infinium HumanMethylation450 BeadChip array (~450,000 DNA methylation data points). There is a large amount of variation in my data due to the plate the samples were run on (3 different plates), the chip they were run on (24 different chips) and the position they were located on the chip - specifically the row (6 different rows). The chips are set up in a 6 row * 2 column format like this: sample 01 sample 02 sample 03 sample 04 sample 05 sample 06 sample 07 sample 08 sample 09 sample 10 sample 11 sample 12 I read Dr. Evan Johnson's suggestions to someone else with this "multiple-batch-effect-variable" problem in the ComBat google group (https://groups.google.com/forum/#!topic/combat-user- forum/PcTxNlaUmAI). He had 2 suggestions: - Combine the two batch variables into one, if 3-4 reps are left in each batch - Use ComBat multiple times, adjusting for the first batch using the other batch variables as covariates, and then adjust for the second batch, and so on I cannot go with the first suggestion because combining the batch variables would create too many categories and I would not have enough replicates per batch category. I am seeking advice on the points: - The google group post is now a few years old, is it still thought that the step-wise correction is a valid approach? - The google group post also was asking about adjusting for 2, not 3 batch variables, does this concern anyone more if I apply ComBat 3 times? - Row would be better treated as a continuous adjustment variable than a factor. In the version of sva that I am using (3.0.2) I believe that only factor adjustment variables are supported. I have seen mention in a few forums that there might be an update to ComBat to adjust for a numeric batch variable, is one available? Thank you in advanced for your help! Magda Price, UBC -- output of sessionInfo(): R version 2.14.0 (2011-10-31) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] grid stats graphics grDevices utils datasets methods base other attached packages: [1] sva_3.0.2 mgcv_1.7-22 corpcor_1.6.4 wateRmelon_1.2.2 [5] IlluminaHumanMethylation450k.db_1.4.6 org.Hs.eg.db_2.6.4 RSQLite_0.11.2 DBI_0.2-5 [9] AnnotationDbi_1.16.19 matrixStats_0.6.2 ROC_1.30.0 limma_3.10.3 [13] RColorBrewer_1.0-5 gplots_2.11.0 MASS_7.3-16 KernSmooth_2.23-6 [17] caTools_1.14 gdata_2.12.0 gtools_2.7.1 compare_0.2-3 [21] lattice_0.20-10 lumi_2.6.0 nleqslv_2.0 methylumi_2.0.13 [25] Biobase_2.14.0 loaded via a namespace (and not attached): [1] affy_1.32.1 affyio_1.22.0 annotate_1.32.3 BiocInstaller_1.2.1 bitops_1.0-5 hdrcde_2.15 IRanges_1.12.6 Matrix_1.0-5 [9] nlme_3.1-108 preprocessCore_1.16.0 R.methodsS3_1.4.2 tools_2.14.0 xtable_1.7-1 zlibbioc_1.0.1 -- Sent via the guest posting facility at bioconductor.org.

GO Category sva GO Category sva • 4.8k views

ADD COMMENT • link updated 10.1 years ago by Steve Lianoglou ★ 13k • written 10.1 years ago by Guest User ★ 13k

score 0 · Answer 1 · 2014-03-18

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 13 months ago

United States

Hi Magda, You are using a version of R (2.14) that is horribly out of date, and as a result your bioconductor packages are frozen to versions that are quite old. Please update to the latest version of R (3.0.3) and reinstall your bioconductor packages using biocLite to ensure that you are running the the latest version of them. The package you are version (sva v3.0.2) is now at version 3.8.0. One question you asked: > - Row would be better treated as a continuous adjustment variable than a factor. In the version of sva that I am using (3.0.2) I believe that only factor adjustment variables are supported. I have seen mention in a few forums that there might be an update to ComBat to adjust for a numeric batch variable, is one available? Is readily answered by reading through the vignette for the current version of the package: http://bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/s va.pdf Specifically in Section 7 (Applying the ComBat function to adjust for known batches), where it states: """ By default, all adjustment variables will be treated as factor variables by the ComBat function. If you would like to include continuous adjustment variables, also create a vector containing the column numbers of the continuous covariates in the model matrix. This vector must then be input into ComBat via the numCovs option. """ HTH, -steve -- Steve Lianoglou Computational Biologist Genentech

ADD COMMENT • link 10.1 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Steve, Thanks for your advice. I do know that I'm using an old version of R (one of the packages I'm using requires it) however, the options you mention from sva are in fact available in the older version as well, but it wasn't clear to me how to use them. I've copied the usage and argument information for the ComBat function below, maybe you can help clarify: *ComBat(dat, batch, mod, numCovs=NULL, par.prior=TRUE,prior.plots=FALSE)* *dat Genomic measure matrix (dimensions probe x sample) - for example, expression matrix* *batch Batch covariate (multiple batches allowed)* *mod Model matrix for outcome of interest and other covariates besides batch* *numCovs (Optional) Vector containing the column numbers of the continuous covariates in the model matrix, or NULL if no continuous covariates are used* *par.prior (Optional) TRUE indicates parametric adjustments will be used, FALSE indicates non-parametric adjustments will be used* *prior.plots (Optional) TRUE give prior plots with black as a kernel estimate of the empirical batch effect density and red as the parametric estimate* The model matrix is supposed to contain the outcome of interest and other covariates *besides batch*, but batch is what I need to be a continuous variable. numCovs seems to allow me to specify *covariates* that should be continuous, but not *adjustment variables*. What am I missing? Thanks again! On Tue, Mar 18, 2014 at 9:48 AM, Steve Lianoglou <lianoglou.steve@gene.com>wrote: > Hi Magda, > > You are using a version of R (2.14) that is horribly out of date, and > as a result your bioconductor packages are frozen to versions that are > quite old. > > Please update to the latest version of R (3.0.3) and reinstall your > bioconductor packages using biocLite to ensure that you are running > the the latest version of them. > > The package you are version (sva v3.0.2) is now at version 3.8.0. > > One question you asked: > > > - Row would be better treated as a continuous adjustment variable than a > factor. In the version of sva that I am using (3.0.2) I believe that only > factor adjustment variables are supported. I have seen mention in a few > forums that there might be an update to ComBat to adjust for a numeric > batch variable, is one available? > > Is readily answered by reading through the vignette for the current > version of the package: > > > http://bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc /sva.pdf > > Specifically in Section 7 (Applying the ComBat function to adjust for > known batches), where it states: > > """ > By default, all adjustment variables will be treated as factor > variables by the ComBat function. If you would like to include > continuous adjustment variables, also create a vector containing the > column numbers of the continuous covariates in the model matrix. This > vector must then be input into ComBat via the numCovs option. > """ > > HTH, > > -steve > > -- > Steve Lianoglou > Computational Biologist > Genentech > -- E. Magda Price PhD Candidate, Robinson Lab University of British Columbia CFRI Room 2071 950 West 28th Ave. Vancouver BC., V5Z 4H4 (604)-875-3015 [[alternative HTML version deleted]]

ADD REPLY • link 10.1 years ago Magda Price ▴ 60

0

Entering edit mode

Hi Magda, I'm curious. How can one specify a batch using a continuous variable? In other words, isn't a particular sample in a batch or not? Best, Jim On 3/18/2014 1:44 PM, Magda Price wrote: > Hi Steve, > > Thanks for your advice. I do know that I'm using an old version of R (one > of the packages I'm using requires it) however, the options you mention > from sva are in fact available in the older version as well, but it wasn't > clear to me how to use them. > > I've copied the usage and argument information for the ComBat function > below, maybe you can help clarify: > > *ComBat(dat, batch, mod, numCovs=NULL, par.prior=TRUE,prior.plots=FALSE)* > > *dat Genomic measure matrix (dimensions probe x sample) - for example, > expression matrix* > > *batch Batch covariate (multiple batches allowed)* > > *mod Model matrix for outcome of interest and other covariates besides > batch* > > *numCovs (Optional) Vector containing the column numbers of the continuous > covariates in the model matrix, or NULL if no continuous covariates are > used* > > *par.prior (Optional) TRUE indicates parametric adjustments will be used, > FALSE indicates non-parametric adjustments will be used* > *prior.plots (Optional) TRUE give prior plots with black as a kernel > estimate of the empirical batch effect density and red as the parametric > estimate* > > The model matrix is supposed to contain the outcome of interest and other > covariates *besides batch*, but batch is what I need to be a continuous > variable. numCovs seems to allow me to specify *covariates* that should be > continuous, but not *adjustment variables*. What am I missing? > > Thanks again! > > > > On Tue, Mar 18, 2014 at 9:48 AM, Steve Lianoglou > <lianoglou.steve at="" gene.com="">wrote: > >> Hi Magda, >> >> You are using a version of R (2.14) that is horribly out of date, and >> as a result your bioconductor packages are frozen to versions that are >> quite old. >> >> Please update to the latest version of R (3.0.3) and reinstall your >> bioconductor packages using biocLite to ensure that you are running >> the the latest version of them. >> >> The package you are version (sva v3.0.2) is now at version 3.8.0. >> >> One question you asked: >> >>> - Row would be better treated as a continuous adjustment variable than a >> factor. In the version of sva that I am using (3.0.2) I believe that only >> factor adjustment variables are supported. I have seen mention in a few >> forums that there might be an update to ComBat to adjust for a numeric >> batch variable, is one available? >> >> Is readily answered by reading through the vignette for the current >> version of the package: >> >> >> http://bioconductor.org/packages/release/bioc/vignettes/sva/inst/do c/sva.pdf >> >> Specifically in Section 7 (Applying the ComBat function to adjust for >> known batches), where it states: >> >> """ >> By default, all adjustment variables will be treated as factor >> variables by the ComBat function. If you would like to include >> continuous adjustment variables, also create a vector containing the >> column numbers of the continuous covariates in the model matrix. This >> vector must then be input into ComBat via the numCovs option. >> """ >> >> HTH, >> >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Genentech >> > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 10.1 years ago James W. MacDonald 65k

0

Entering edit mode

Hi Jim, I have several different "batch" variables - one for example is the chip that each sample was run on (there are 24 of these) and I think chip batch should definitely be treated as a factor. Another "batch" variable I would like to adjust for is the position the sample was run on the chip (there are 6 different rows). If I use row as a factor, then the effect of being in row 1 vs 2 is treated the same as the effect of 1 vs 6, but the bias I see changes step-wise from row 1, 2, 3, 4, 5, 6 thus I thought that treating row as a numeric or integer variable would better model the "batch" effect. In other words row batches have meaning relative to each other whereas chip batches do not. I guess this would be another reason why using the numCovs option (continuous not integer) might not work in my case?! Hope that explains things a bit better! Happy to provide any more info & I really appreciate the input. Magda On Tue, Mar 18, 2014 at 10:51 AM, James W. MacDonald <jmacdon@uw.edu> wrote: > Hi Magda, > > I'm curious. How can one specify a batch using a continuous variable? In > other words, isn't a particular sample in a batch or not? > > Best, > > Jim > > > > On 3/18/2014 1:44 PM, Magda Price wrote: > >> Hi Steve, >> >> Thanks for your advice. I do know that I'm using an old version of R (one >> of the packages I'm using requires it) however, the options you mention >> from sva are in fact available in the older version as well, but it wasn't >> clear to me how to use them. >> >> I've copied the usage and argument information for the ComBat function >> below, maybe you can help clarify: >> >> *ComBat(dat, batch, mod, numCovs=NULL, par.prior=TRUE,prior.plots=FALSE)* >> >> *dat Genomic measure matrix (dimensions probe x sample) - for example, >> expression matrix* >> >> *batch Batch covariate (multiple batches allowed)* >> >> *mod Model matrix for outcome of interest and other covariates besides >> batch* >> >> *numCovs (Optional) Vector containing the column numbers of the continuous >> >> covariates in the model matrix, or NULL if no continuous covariates are >> used* >> >> *par.prior (Optional) TRUE indicates parametric adjustments will be used, >> FALSE indicates non-parametric adjustments will be used* >> *prior.plots (Optional) TRUE give prior plots with black as a kernel >> >> estimate of the empirical batch effect density and red as the parametric >> estimate* >> >> >> The model matrix is supposed to contain the outcome of interest and other >> covariates *besides batch*, but batch is what I need to be a continuous >> variable. numCovs seems to allow me to specify *covariates* that should be >> continuous, but not *adjustment variables*. What am I missing? >> >> >> Thanks again! >> >> >> >> On Tue, Mar 18, 2014 at 9:48 AM, Steve Lianoglou >> <lianoglou.steve@gene.com>wrote: >> >> Hi Magda, >>> >>> You are using a version of R (2.14) that is horribly out of date, and >>> as a result your bioconductor packages are frozen to versions that are >>> quite old. >>> >>> Please update to the latest version of R (3.0.3) and reinstall your >>> bioconductor packages using biocLite to ensure that you are running >>> the the latest version of them. >>> >>> The package you are version (sva v3.0.2) is now at version 3.8.0. >>> >>> One question you asked: >>> >>> - Row would be better treated as a continuous adjustment variable than a >>>> >>> factor. In the version of sva that I am using (3.0.2) I believe that only >>> factor adjustment variables are supported. I have seen mention in a few >>> forums that there might be an update to ComBat to adjust for a numeric >>> batch variable, is one available? >>> >>> Is readily answered by reading through the vignette for the current >>> version of the package: >>> >>> >>> http://bioconductor.org/packages/release/bioc/ >>> vignettes/sva/inst/doc/sva.pdf >>> >>> Specifically in Section 7 (Applying the ComBat function to adjust for >>> known batches), where it states: >>> >>> """ >>> By default, all adjustment variables will be treated as factor >>> variables by the ComBat function. If you would like to include >>> continuous adjustment variables, also create a vector containing the >>> column numbers of the continuous covariates in the model matrix. This >>> vector must then be input into ComBat via the numCovs option. >>> """ >>> >>> HTH, >>> >>> -steve >>> >>> -- >>> Steve Lianoglou >>> Computational Biologist >>> Genentech >>> >>> >> >> > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > -- E. Magda Price PhD Candidate, Robinson Lab University of British Columbia CFRI Room 2071 950 West 28th Ave. Vancouver BC., V5Z 4H4 (604)-875-3015 [[alternative HTML version deleted]]

ADD REPLY • link 10.1 years ago Magda Price ▴ 60

0

Entering edit mode

Hi Magda, The numCovs argument won't work because that is simply used to specify columns in the model matrix (of non-batch things you want to fit in your linear model) that are continuous covariates rather than fixed effects. It has nothing to do with correcting for the batch effect. And I think you might be thinking about batch effects in the wrong way. If you fit a 'row' effect, then what you are saying is that on average, the measures you get from one row differ from the measures you get from another row. So as an example, row 1 might tend to have higher values because those arrays don't get washed as well, whereas rows 3 and 4 might be dimmer because they get washed more. You then want to estimate how much brighter on average, the row1 chips are (and how much dimmer the row 3 and 4 chips are), and adjust the observed data to account for this. But you do the estimation of these averages using factors, rather than continuous measures (because a chip either is or is not in row 1). You might just be over-thinking this. I don't see how 3 plates of 24 chips gets you to 180 samples, but regardless it seems like you have enough replication to estimate the batch effects, and still have enough degrees of freedom left over for your comparisons, unless you have some huge number of phenotypic combinations that you are trying to compare (do you?). Best, Jim On Tuesday, March 18, 2014 2:13:11 PM, Magda Price wrote: > Hi Jim, > > I have several different "batch" variables - one for example is the > chip that each sample was run on (there are 24 of these) and I think > chip batch should definitely be treated as a factor. Another "batch" > variable I would like to adjust for is the position the sample was run > on the chip (there are 6 different rows). If I use row as a factor, > then the effect of being in row 1 vs 2 is treated the same as the > effect of 1 vs 6, but the bias I see changes step-wise from row 1, 2, > 3, 4, 5, 6 thus I thought that treating row as a numeric or integer > variable would better model the "batch" effect. In other words row > batches have meaning relative to each other whereas chip batches do not. > > I guess this would be another reason why using the numCovs option > (continuous not integer) might not work in my case?! > > Hope that explains things a bit better! Happy to provide any more info > & I really appreciate the input. > > Magda > > > On Tue, Mar 18, 2014 at 10:51 AM, James W. MacDonald <jmacdon at="" uw.edu=""> <mailto:jmacdon at="" uw.edu="">> wrote: > > Hi Magda, > > I'm curious. How can one specify a batch using a continuous > variable? In other words, isn't a particular sample in a batch or not? > > Best, > > Jim > > > > On 3/18/2014 1:44 PM, Magda Price wrote: > > Hi Steve, > > Thanks for your advice. I do know that I'm using an old > version of R (one > of the packages I'm using requires it) however, the options > you mention > from sva are in fact available in the older version as well, > but it wasn't > clear to me how to use them. > > I've copied the usage and argument information for the ComBat > function > below, maybe you can help clarify: > > *ComBat(dat, batch, mod, numCovs=NULL, > par.prior=TRUE,prior.plots=__FALSE)* > > *dat Genomic measure matrix (dimensions probe x sample) - for > example, > expression matrix* > > *batch Batch covariate (multiple batches allowed)* > > *mod Model matrix for outcome of interest and other covariates > besides > batch* > > *numCovs (Optional) Vector containing the column numbers of > the continuous > > covariates in the model matrix, or NULL if no continuous > covariates are > used* > > *par.prior (Optional) TRUE indicates parametric adjustments > will be used, > FALSE indicates non-parametric adjustments will be used* > *prior.plots (Optional) TRUE give prior plots with black as a > kernel > > estimate of the empirical batch effect density and red as the > parametric > estimate* > > > The model matrix is supposed to contain the outcome of > interest and other > covariates *besides batch*, but batch is what I need to be a > continuous > variable. numCovs seems to allow me to specify *covariates* > that should be > continuous, but not *adjustment variables*. What am I missing? > > > Thanks again! > > > > On Tue, Mar 18, 2014 at 9:48 AM, Steve Lianoglou > <lianoglou.steve at="" gene.com=""> <mailto:lianoglou.steve at="" gene.com="">>__wrote: > > Hi Magda, > > You are using a version of R (2.14) that is horribly out > of date, and > as a result your bioconductor packages are frozen to > versions that are > quite old. > > Please update to the latest version of R (3.0.3) and > reinstall your > bioconductor packages using biocLite to ensure that you > are running > the the latest version of them. > > The package you are version (sva v3.0.2) is now at version > 3.8.0. > > One question you asked: > > - Row would be better treated as a continuous > adjustment variable than a > > factor. In the version of sva that I am using (3.0.2) I > believe that only > factor adjustment variables are supported. I have seen > mention in a few > forums that there might be an update to ComBat to adjust > for a numeric > batch variable, is one available? > > Is readily answered by reading through the vignette for > the current > version of the package: > > > http://bioconductor.org/__packages/release/bioc/__vignet tes/sva/inst/doc/sva.pdf > <http: bioconductor.org="" packages="" release="" bioc="" vignettes="" sva="" inst="" doc="" sva.pdf=""> > > Specifically in Section 7 (Applying the ComBat function to > adjust for > known batches), where it states: > > """ > By default, all adjustment variables will be treated as factor > variables by the ComBat function. If you would like to include > continuous adjustment variables, also create a vector > containing the > column numbers of the continuous covariates in the model > matrix. This > vector must then be input into ComBat via the numCovs option. > """ > > HTH, > > -steve > > -- > Steve Lianoglou > Computational Biologist > Genentech > > > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > > > > -- > E. Magda Price > PhD Candidate, Robinson Lab > University of British Columbia > > CFRI Room 2071 > 950 West 28th Ave. > Vancouver BC., V5Z 4H4 > (604)-875-3015 -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 10.1 years ago James W. MacDonald 65k

0

Entering edit mode

Hi Jim, Re numCovs - what you've stated was how I interpreted the use as well, which is why I didn't think it would helpful. As usual with these types of human disease datasets, the study design is not ideal, and more complicated than I initially let on! The 180 samples are a combination of 3 phenotype groups (1 control + 2 diseased) and 5 different tissues. Other samples, unrelated to this project were also run on these chips, which is why I'm working with less samples than the total that were run (which was 288). Here's a simplified version of what my ComBat code looks like: #1 - correct for plate effect mod.1<- model.matrix(~tissue+group+row+chip, data=des) bat.1<- ComBat(data, des$plate, mod.1) #2 - correct for row effect mod.2<-model.matrix(~tissue+group+chip, data=des) bat.2<-ComBat(data=bat.1, des$row, mod.2) #3 - correct for chip mod.3<-model.matrix(~tissue+group,data=des) bat.3<-ComBat(data=bat.2, des$chip,mod.3) We know from some pilot studies that the effect size (i.e. differential methylation between disease vs control samples in a give tissue) is small, so I am concerned about being thorough in the batch correction. I'm new to batch correction and you've correctly understood my concern about the row effect; so it sounds to me that how I have modeled the effect in the code above (i.e. each batch variable as a factor) was correct. Any corrections/suggestions for what I've done above? Thanks! On Tue, Mar 18, 2014 at 2:27 PM, James W. MacDonald <jmacdon@uw.edu> wrote: > Hi Magda, > > The numCovs argument won't work because that is simply used to specify > columns in the model matrix (of non-batch things you want to fit in your > linear model) that are continuous covariates rather than fixed effects. It > has nothing to do with correcting for the batch effect. > > And I think you might be thinking about batch effects in the wrong way. If > you fit a 'row' effect, then what you are saying is that on average, the > measures you get from one row differ from the measures you get from another > row. So as an example, row 1 might tend to have higher values because those > arrays don't get washed as well, whereas rows 3 and 4 might be dimmer > because they get washed more. You then want to estimate how much brighter > on average, the row1 chips are (and how much dimmer the row 3 and 4 chips > are), and adjust the observed data to account for this. > > But you do the estimation of these averages using factors, rather than > continuous measures (because a chip either is or is not in row 1). > > You might just be over-thinking this. I don't see how 3 plates of 24 chips > gets you to 180 samples, but regardless it seems like you have enough > replication to estimate the batch effects, and still have enough degrees of > freedom left over for your comparisons, unless you have some huge number of > phenotypic combinations that you are trying to compare (do you?). > > Best, > > Jim > > > > > On Tuesday, March 18, 2014 2:13:11 PM, Magda Price wrote: > >> Hi Jim, >> >> I have several different "batch" variables - one for example is the >> chip that each sample was run on (there are 24 of these) and I think >> chip batch should definitely be treated as a factor. Another "batch" >> variable I would like to adjust for is the position the sample was run >> on the chip (there are 6 different rows). If I use row as a factor, >> then the effect of being in row 1 vs 2 is treated the same as the >> effect of 1 vs 6, but the bias I see changes step-wise from row 1, 2, >> 3, 4, 5, 6 thus I thought that treating row as a numeric or integer >> variable would better model the "batch" effect. In other words row >> batches have meaning relative to each other whereas chip batches do not. >> >> I guess this would be another reason why using the numCovs option >> (continuous not integer) might not work in my case?! >> >> Hope that explains things a bit better! Happy to provide any more info >> & I really appreciate the input. >> >> Magda >> >> >> On Tue, Mar 18, 2014 at 10:51 AM, James W. MacDonald <jmacdon@uw.edu>> <mailto:jmacdon@uw.edu>> wrote: >> >> Hi Magda, >> >> I'm curious. How can one specify a batch using a continuous >> variable? In other words, isn't a particular sample in a batch or not? >> >> Best, >> >> Jim >> >> >> >> On 3/18/2014 1:44 PM, Magda Price wrote: >> >> Hi Steve, >> >> Thanks for your advice. I do know that I'm using an old >> version of R (one >> of the packages I'm using requires it) however, the options >> you mention >> from sva are in fact available in the older version as well, >> but it wasn't >> clear to me how to use them. >> >> I've copied the usage and argument information for the ComBat >> function >> below, maybe you can help clarify: >> >> *ComBat(dat, batch, mod, numCovs=NULL, >> par.prior=TRUE,prior.plots=__FALSE)* >> >> >> *dat Genomic measure matrix (dimensions probe x sample) - for >> example, >> expression matrix* >> >> *batch Batch covariate (multiple batches allowed)* >> >> *mod Model matrix for outcome of interest and other covariates >> besides >> batch* >> >> *numCovs (Optional) Vector containing the column numbers of >> the continuous >> >> covariates in the model matrix, or NULL if no continuous >> covariates are >> used* >> >> *par.prior (Optional) TRUE indicates parametric adjustments >> will be used, >> FALSE indicates non-parametric adjustments will be used* >> *prior.plots (Optional) TRUE give prior plots with black as a >> kernel >> >> estimate of the empirical batch effect density and red as the >> parametric >> estimate* >> >> >> The model matrix is supposed to contain the outcome of >> interest and other >> covariates *besides batch*, but batch is what I need to be a >> continuous >> variable. numCovs seems to allow me to specify *covariates* >> that should be >> continuous, but not *adjustment variables*. What am I missing? >> >> >> Thanks again! >> >> >> >> On Tue, Mar 18, 2014 at 9:48 AM, Steve Lianoglou >> <lianoglou.steve@gene.com>> <mailto:lianoglou.steve@gene.com>>__wrote: >> >> >> Hi Magda, >> >> You are using a version of R (2.14) that is horribly out >> of date, and >> as a result your bioconductor packages are frozen to >> versions that are >> quite old. >> >> Please update to the latest version of R (3.0.3) and >> reinstall your >> bioconductor packages using biocLite to ensure that you >> are running >> the the latest version of them. >> >> The package you are version (sva v3.0.2) is now at version >> 3.8.0. >> >> One question you asked: >> >> - Row would be better treated as a continuous >> adjustment variable than a >> >> factor. In the version of sva that I am using (3.0.2) I >> believe that only >> factor adjustment variables are supported. I have seen >> mention in a few >> forums that there might be an update to ComBat to adjust >> for a numeric >> batch variable, is one available? >> >> Is readily answered by reading through the vignette for >> the current >> version of the package: >> >> >> http://bioconductor.org/__packages/release/bioc/__ >> vignettes/sva/inst/doc/sva.pdf >> >> <http: bioconductor.org="" packages="" release="" bioc="">> vignettes/sva/inst/doc/sva.pdf> >> >> Specifically in Section 7 (Applying the ComBat function to >> adjust for >> known batches), where it states: >> >> """ >> By default, all adjustment variables will be treated as factor >> variables by the ComBat function. If you would like to include >> continuous adjustment variables, also create a vector >> containing the >> column numbers of the continuous covariates in the model >> matrix. This >> vector must then be input into ComBat via the numCovs option. >> """ >> >> HTH, >> >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Genentech >> >> >> >> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 >> >> >> >> >> -- >> E. Magda Price >> PhD Candidate, Robinson Lab >> University of British Columbia >> >> CFRI Room 2071 >> 950 West 28th Ave. >> Vancouver BC., V5Z 4H4 >> (604)-875-3015 >> > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > -- E. Magda Price PhD Candidate, Robinson Lab University of British Columbia CFRI Room 2071 950 West 28th Ave. Vancouver BC., V5Z 4H4 (604)-875-3015 [[alternative HTML version deleted]]

ADD REPLY • link 10.1 years ago Magda Price ▴ 60

0

Entering edit mode

Hi Magda, I'm not sure you need to do things sequentially like that. From what I can tell, you should just be able to do mod <- model.matrix(~tissue, des) bat <- ComBat(data, des[,c("plate","row","chip")], mod) And go from there. Best, Jim On 3/18/2014 6:04 PM, Magda Price wrote: > Hi Jim, > > Re numCovs - what you've stated was how I interpreted the use as well, > which is why I didn't think it would helpful. > > As usual with these types of human disease datasets, the study design > is not ideal, and more complicated than I initially let on! The 180 > samples are a combination of 3 phenotype groups (1 control + 2 > diseased) and 5 different tissues. Other samples, unrelated to this > project were also run on these chips, which is why I'm working with > less samples than the total that were run (which was 288). > > Here's a simplified version of what my ComBat code looks like: > > #1 - correct for plate effect > mod.1<- model.matrix(~tissue+group+row+chip, data=des) > bat.1<- ComBat(data, des$plate, mod.1) > > #2 - correct for row effect > mod.2<-model.matrix(~tissue+group+chip, data=des) > bat.2<-ComBat(data=bat.1, des$row, mod.2) > > #3 - correct for chip > mod.3<-model.matrix(~tissue+group,data=des) > bat.3<-ComBat(data=bat.2, des$chip,mod.3) > We know from some pilot studies that the effect size (i.e. > differential methylation between disease vs control samples in a give > tissue) is small, so I am concerned about being thorough in the batch > correction. I'm new to batch correction and you've correctly > understood my concern about the row effect; so it sounds to me that > how I have modeled the effect in the code above (i.e. each batch > variable as a factor) was correct. Any corrections/suggestions for > what I've done above? > > Thanks! > > > On Tue, Mar 18, 2014 at 2:27 PM, James W. MacDonald <jmacdon at="" uw.edu=""> <mailto:jmacdon at="" uw.edu="">> wrote: > > Hi Magda, > > The numCovs argument won't work because that is simply used to > specify columns in the model matrix (of non-batch things you want > to fit in your linear model) that are continuous covariates rather > than fixed effects. It has nothing to do with correcting for the > batch effect. > > And I think you might be thinking about batch effects in the wrong > way. If you fit a 'row' effect, then what you are saying is that > on average, the measures you get from one row differ from the > measures you get from another row. So as an example, row 1 might > tend to have higher values because those arrays don't get washed > as well, whereas rows 3 and 4 might be dimmer because they get > washed more. You then want to estimate how much brighter on > average, the row1 chips are (and how much dimmer the row 3 and 4 > chips are), and adjust the observed data to account for this. > > But you do the estimation of these averages using factors, rather > than continuous measures (because a chip either is or is not in > row 1). > > You might just be over-thinking this. I don't see how 3 plates of > 24 chips gets you to 180 samples, but regardless it seems like you > have enough replication to estimate the batch effects, and still > have enough degrees of freedom left over for your comparisons, > unless you have some huge number of phenotypic combinations that > you are trying to compare (do you?). > > Best, > > Jim > > > > > On Tuesday, March 18, 2014 2:13:11 PM, Magda Price wrote: > > Hi Jim, > > I have several different "batch" variables - one for example > is the > chip that each sample was run on (there are 24 of these) and I > think > chip batch should definitely be treated as a factor. Another > "batch" > variable I would like to adjust for is the position the sample > was run > on the chip (there are 6 different rows). If I use row as a > factor, > then the effect of being in row 1 vs 2 is treated the same as the > effect of 1 vs 6, but the bias I see changes step-wise from > row 1, 2, > 3, 4, 5, 6 thus I thought that treating row as a numeric or > integer > variable would better model the "batch" effect. In other words row > batches have meaning relative to each other whereas chip > batches do not. > > I guess this would be another reason why using the numCovs option > (continuous not integer) might not work in my case?! > > Hope that explains things a bit better! Happy to provide any > more info > & I really appreciate the input. > > Magda > > > On Tue, Mar 18, 2014 at 10:51 AM, James W. MacDonald > <jmacdon at="" uw.edu="" <mailto:jmacdon="" at="" uw.edu=""> > <mailto:jmacdon at="" uw.edu="" <mailto:jmacdon="" at="" uw.edu="">>> wrote: > > Hi Magda, > > I'm curious. How can one specify a batch using a continuous > variable? In other words, isn't a particular sample in a > batch or not? > > Best, > > Jim > > > > On 3/18/2014 1:44 PM, Magda Price wrote: > > Hi Steve, > > Thanks for your advice. I do know that I'm using an old > version of R (one > of the packages I'm using requires it) however, the > options > you mention > from sva are in fact available in the older version as > well, > but it wasn't > clear to me how to use them. > > I've copied the usage and argument information for the > ComBat > function > below, maybe you can help clarify: > > *ComBat(dat, batch, mod, numCovs=NULL, > par.prior=TRUE,prior.plots=__FALSE)* > > > *dat Genomic measure matrix (dimensions probe x > sample) - for > example, > expression matrix* > > *batch Batch covariate (multiple batches allowed)* > > *mod Model matrix for outcome of interest and other > covariates > besides > batch* > > *numCovs (Optional) Vector containing the column > numbers of > the continuous > > covariates in the model matrix, or NULL if no continuous > covariates are > used* > > *par.prior (Optional) TRUE indicates parametric > adjustments > will be used, > FALSE indicates non-parametric adjustments will be used* > *prior.plots (Optional) TRUE give prior plots with > black as a > kernel > > estimate of the empirical batch effect density and red > as the > parametric > estimate* > > > The model matrix is supposed to contain the outcome of > interest and other > covariates *besides batch*, but batch is what I need > to be a > continuous > variable. numCovs seems to allow me to specify > *covariates* > that should be > continuous, but not *adjustment variables*. What am I > missing? > > > Thanks again! > > > > On Tue, Mar 18, 2014 at 9:48 AM, Steve Lianoglou > <lianoglou.steve at="" gene.com=""> <mailto:lianoglou.steve at="" gene.com=""> > <mailto:lianoglou.steve at="" gene.com=""> <mailto:lianoglou.steve at="" gene.com="">>>__wrote: > > > Hi Magda, > > You are using a version of R (2.14) that is > horribly out > of date, and > as a result your bioconductor packages are frozen to > versions that are > quite old. > > Please update to the latest version of R (3.0.3) and > reinstall your > bioconductor packages using biocLite to ensure > that you > are running > the the latest version of them. > > The package you are version (sva v3.0.2) is now at > version > 3.8.0. > > One question you asked: > > - Row would be better treated as a continuous > adjustment variable than a > > factor. In the version of sva that I am using > (3.0.2) I > believe that only > factor adjustment variables are supported. I have seen > mention in a few > forums that there might be an update to ComBat to > adjust > for a numeric > batch variable, is one available? > > Is readily answered by reading through the > vignette for > the current > version of the package: > > > http://bioconductor.org/__packages/release/bioc/__vignettes/ sva/inst/doc/sva.pdf > > > > <http: bioconductor.org="" packages="" release="" bioc="" vignettes="" sva="" inst="" doc="" sva.pdf=""> > > Specifically in Section 7 (Applying the ComBat > function to > adjust for > known batches), where it states: > > """ > By default, all adjustment variables will be > treated as factor > variables by the ComBat function. If you would > like to include > continuous adjustment variables, also create a vector > containing the > column numbers of the continuous covariates in the > model > matrix. This > vector must then be input into ComBat via the > numCovs option. > """ > > HTH, > > -steve > > -- > Steve Lianoglou > Computational Biologist > Genentech > > > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > > > > -- > E. Magda Price > PhD Candidate, Robinson Lab > University of British Columbia > > CFRI Room 2071 > 950 West 28th Ave. > Vancouver BC., V5Z 4H4 > (604)-875-3015 <tel:%28604%29-875-3015> > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > > > > -- > E. Magda Price > PhD Candidate, Robinson Lab > University of British Columbia > > CFRI Room 2071 > 950 West 28th Ave. > Vancouver BC., V5Z 4H4 > (604)-875-3015 -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 10.1 years ago James W. MacDonald 65k

0

Entering edit mode

Unfortunately that doesn't work. Only one batch variable is allowed. Thanks for your suggestion though. On Wed, Mar 19, 2014 at 6:58 AM, James W. MacDonald <jmacdon@uw.edu> wrote: > Hi Magda, > > I'm not sure you need to do things sequentially like that. From what I can > tell, you should just be able to do > > mod <- model.matrix(~tissue, des) > bat <- ComBat(data, des[,c("plate","row","chip")], mod) > > And go from there. > > Best, > > Jim > > > > On 3/18/2014 6:04 PM, Magda Price wrote: > >> Hi Jim, >> >> Re numCovs - what you've stated was how I interpreted the use as well, >> which is why I didn't think it would helpful. >> >> As usual with these types of human disease datasets, the study design is >> not ideal, and more complicated than I initially let on! The 180 samples >> are a combination of 3 phenotype groups (1 control + 2 diseased) and 5 >> different tissues. Other samples, unrelated to this project were also run >> on these chips, which is why I'm working with less samples than the total >> that were run (which was 288). >> >> Here's a simplified version of what my ComBat code looks like: >> >> #1 - correct for plate effect >> mod.1<- model.matrix(~tissue+group+row+chip, data=des) >> bat.1<- ComBat(data, des$plate, mod.1) >> >> #2 - correct for row effect >> mod.2<-model.matrix(~tissue+group+chip, data=des) >> bat.2<-ComBat(data=bat.1, des$row, mod.2) >> >> #3 - correct for chip >> mod.3<-model.matrix(~tissue+group,data=des) >> bat.3<-ComBat(data=bat.2, des$chip,mod.3) >> We know from some pilot studies that the effect size (i.e. differential >> methylation between disease vs control samples in a give tissue) is small, >> so I am concerned about being thorough in the batch correction. I'm new to >> batch correction and you've correctly understood my concern about the row >> effect; so it sounds to me that how I have modeled the effect in the code >> above (i.e. each batch variable as a factor) was correct. Any >> corrections/suggestions for what I've done above? >> >> Thanks! >> >> >> On Tue, Mar 18, 2014 at 2:27 PM, James W. MacDonald <jmacdon@uw.edu<mailto:>> jmacdon@uw.edu>> wrote: >> >> Hi Magda, >> >> The numCovs argument won't work because that is simply used to >> specify columns in the model matrix (of non-batch things you want >> to fit in your linear model) that are continuous covariates rather >> than fixed effects. It has nothing to do with correcting for the >> batch effect. >> >> And I think you might be thinking about batch effects in the wrong >> way. If you fit a 'row' effect, then what you are saying is that >> on average, the measures you get from one row differ from the >> measures you get from another row. So as an example, row 1 might >> tend to have higher values because those arrays don't get washed >> as well, whereas rows 3 and 4 might be dimmer because they get >> washed more. You then want to estimate how much brighter on >> average, the row1 chips are (and how much dimmer the row 3 and 4 >> chips are), and adjust the observed data to account for this. >> >> But you do the estimation of these averages using factors, rather >> than continuous measures (because a chip either is or is not in >> row 1). >> >> You might just be over-thinking this. I don't see how 3 plates of >> 24 chips gets you to 180 samples, but regardless it seems like you >> have enough replication to estimate the batch effects, and still >> have enough degrees of freedom left over for your comparisons, >> unless you have some huge number of phenotypic combinations that >> you are trying to compare (do you?). >> >> Best, >> >> Jim >> >> >> >> >> On Tuesday, March 18, 2014 2:13:11 PM, Magda Price wrote: >> >> Hi Jim, >> >> I have several different "batch" variables - one for example >> is the >> chip that each sample was run on (there are 24 of these) and I >> think >> chip batch should definitely be treated as a factor. Another >> "batch" >> variable I would like to adjust for is the position the sample >> was run >> on the chip (there are 6 different rows). If I use row as a >> factor, >> then the effect of being in row 1 vs 2 is treated the same as the >> effect of 1 vs 6, but the bias I see changes step-wise from >> row 1, 2, >> 3, 4, 5, 6 thus I thought that treating row as a numeric or >> integer >> variable would better model the "batch" effect. In other words row >> batches have meaning relative to each other whereas chip >> batches do not. >> >> I guess this would be another reason why using the numCovs option >> (continuous not integer) might not work in my case?! >> >> Hope that explains things a bit better! Happy to provide any >> more info >> & I really appreciate the input. >> >> Magda >> >> >> On Tue, Mar 18, 2014 at 10:51 AM, James W. MacDonald >> <jmacdon@uw.edu <mailto:jmacdon@uw.edu=""> >> <mailto:jmacdon@uw.edu <mailto:jmacdon@uw.edu="">>> wrote: >> >> Hi Magda, >> >> I'm curious. How can one specify a batch using a continuous >> variable? In other words, isn't a particular sample in a >> batch or not? >> >> Best, >> >> Jim >> >> >> >> On 3/18/2014 1:44 PM, Magda Price wrote: >> >> Hi Steve, >> >> Thanks for your advice. I do know that I'm using an old >> version of R (one >> of the packages I'm using requires it) however, the >> options >> you mention >> from sva are in fact available in the older version as >> well, >> but it wasn't >> clear to me how to use them. >> >> I've copied the usage and argument information for the >> ComBat >> function >> below, maybe you can help clarify: >> >> *ComBat(dat, batch, mod, numCovs=NULL, >> par.prior=TRUE,prior.plots=__FALSE)* >> >> >> *dat Genomic measure matrix (dimensions probe x >> sample) - for >> example, >> expression matrix* >> >> *batch Batch covariate (multiple batches allowed)* >> >> *mod Model matrix for outcome of interest and other >> covariates >> besides >> batch* >> >> *numCovs (Optional) Vector containing the column >> numbers of >> the continuous >> >> covariates in the model matrix, or NULL if no continuous >> covariates are >> used* >> >> *par.prior (Optional) TRUE indicates parametric >> adjustments >> will be used, >> FALSE indicates non-parametric adjustments will be used* >> *prior.plots (Optional) TRUE give prior plots with >> black as a >> kernel >> >> estimate of the empirical batch effect density and red >> as the >> parametric >> estimate* >> >> >> The model matrix is supposed to contain the outcome of >> interest and other >> covariates *besides batch*, but batch is what I need >> to be a >> continuous >> variable. numCovs seems to allow me to specify >> *covariates* >> that should be >> continuous, but not *adjustment variables*. What am I >> missing? >> >> >> Thanks again! >> >> >> >> On Tue, Mar 18, 2014 at 9:48 AM, Steve Lianoglou >> <lianoglou.steve@gene.com>> <mailto:lianoglou.steve@gene.com> >> <mailto:lianoglou.steve@gene.com>> >> <mailto:lianoglou.steve@gene.com>>>__wrote: >> >> >> Hi Magda, >> >> You are using a version of R (2.14) that is >> horribly out >> of date, and >> as a result your bioconductor packages are frozen to >> versions that are >> quite old. >> >> Please update to the latest version of R (3.0.3) and >> reinstall your >> bioconductor packages using biocLite to ensure >> that you >> are running >> the the latest version of them. >> >> The package you are version (sva v3.0.2) is now at >> version >> 3.8.0. >> >> One question you asked: >> >> - Row would be better treated as a continuous >> adjustment variable than a >> >> factor. In the version of sva that I am using >> (3.0.2) I >> believe that only >> factor adjustment variables are supported. I have seen >> mention in a few >> forums that there might be an update to ComBat to >> adjust >> for a numeric >> batch variable, is one available? >> >> Is readily answered by reading through the >> vignette for >> the current >> version of the package: >> >> >> http://bioconductor.org/__packages/release/bioc/__ >> vignettes/sva/inst/doc/sva.pdf >> >> >> <http: bioconductor.org="">> packages/release/bioc/vignettes/sva/inst/doc/sva.pdf> >> >> Specifically in Section 7 (Applying the ComBat >> function to >> adjust for >> known batches), where it states: >> >> """ >> By default, all adjustment variables will be >> treated as factor >> variables by the ComBat function. If you would >> like to include >> continuous adjustment variables, also create a vector >> containing the >> column numbers of the continuous covariates in the >> model >> matrix. This >> vector must then be input into ComBat via the >> numCovs option. >> """ >> >> HTH, >> >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Genentech >> >> >> >> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 >> >> >> >> >> -- >> E. Magda Price >> PhD Candidate, Robinson Lab >> University of British Columbia >> >> CFRI Room 2071 >> 950 West 28th Ave. >> Vancouver BC., V5Z 4H4 >> (604)-875-3015 <tel:%28604%29-875-3015> >> >> >> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 >> >> >> >> >> -- >> E. Magda Price >> PhD Candidate, Robinson Lab >> University of British Columbia >> >> CFRI Room 2071 >> 950 West 28th Ave. >> Vancouver BC., V5Z 4H4 >> (604)-875-3015 >> > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > -- E. Magda Price PhD Candidate, Robinson Lab University of British Columbia CFRI Room 2071 950 West 28th Ave. Vancouver BC., V5Z 4H4 (604)-875-3015 [[alternative HTML version deleted]]

ADD REPLY • link 10.1 years ago Magda Price ▴ 60