Confirming a batch effect has been removed

0

Entering edit mode

Hoyles, Lesley ▴ 50

@hoyles-lesley-5445

Last seen 10.6 years ago

Hi I am working with single-channel Agilent data. I have two batches of microarrays for biopsies from different patients (batch 1, 25 arrays; batch 2, 34 arrays). There are no replicate samples for any of the patients in either of the batches. The patients are classified according to disease score. Using MDS (and confirmed with PCA) I have found there is a pronounced batch effect. Initially, I tried removing the batch effect using METHOD 1. Using this method I found I was getting 1000s rather than the expected 10s or 100s of genes being differentially expressed. However, I was able to demonstrate with MDS that I had eliminated the batch effect as I was able to plot yave. Using METHOD 2, I got the number of genes I expected to see differentially expressed (based on what I'd seen on analyses with the batches treated as separate entities). My main questions are: How can I show I have removed the batch effect from the dataset using METHOD 2, as I have not made any changes to the expression matrix and have incorporated the batch effect into the model? What value of fit2$df.prior is seen as demonstrating more significant differential expression? I have seen on a list answer that 'A larger value will result in more significant differential expression, and a small value will result in less differential expression'. I'd also be grateful of an explanation as to why I'm seeing such low adjusted P values using METHOD 1. Thanks in advance for your assistance. Best wishes Lesley METHOD 1 ######## condition.f <- factor(targets$score_score, levels = unique(targets$score_score)) batch.f <- factor(targets$Batch, levels = unique(targets$Batch)) design <- model.matrix(~0+condition.f) y0.batch <- removeBatchEffect(y0, batch.f, design=design) colnames(design) <- levels(condition.f) yave <- avereps(y0.batch, ID=y0$genes[,"GeneName"]) fit <- lmFit(yave, design) contrast.matrix <- makeContrasts(score_1-score_0, score_2-score_0, score_3-score_0, score_2-score_1, score_3-score_1, score_3-score_2, score_5-score_0, score_5-score_1, score_5-score_2, score_5-score_3, levels=design) contrast.matrix <- makeContrasts(score_1-score_0, score_2-score_0, score_3-score_0, score_2-score_1, score_3-score_1, score_3-score_2, levels=design) fit2 <- contrasts.fit(fit, contrast.matrix) fit2 <- eBayes(fit2) fit2$df.prior fit2$s2.prior Summary results from normalized data with controls included score_1 - score_0 score_2 - score_0 score_3 - score_0 score_2 - score_1 -1 2556 1736 726 0 0 26331 27969 29127 30574 1 1687 869 721 0 score_3 - score_1 score_3 - score_2 score_5 - score_0 score_5 - score_1 -1 404 0 879 12 0 29323 30572 29401 30492 1 847 2 294 70 score_5 - score_2 score_5 - score_3 -1 29 123 0 30463 30365 1 82 86 METHOD 2 ######## condition.f <- factor(targets$score_score, levels = unique(targets$score_score)) batch.f <- factor(targets$Batch, levels = unique(targets$Batch)) design <- model.matrix(~0+condition.f+batch.f) colnames(design) <- c("score_3", "score_2", "score_1", "score_0", "score_5", "Batch") y.ave <- avereps(y, ID=y$genes[,"GeneName"]) fit <- lmFit(y.ave, design) write.table(fit, file = "QC/Normalized_controls_included/Faulty_remove d_batch/Faulty_removed_batch_fit.txt", sep = "\t") contrast.matrix <- makeContrasts(a = score_1-score_0, b = score_2-score_0, c = score_3-score_0, d = score_2-score_1, e = score_3-score_1, f = score_3-score_2, g = score_5-score_0, h = score_5-score_1, i = score_5-score_2, j = score_5-score_3, levels=design) fit2 <- contrasts.fit(fit, contrast.matrix) fit2 <- eBayes(fit2) fit2$df.prior fit2$s2.prior Summary results from normalized data with controls included a b c d e f g h i j -1 0 0 23 0 1 0 5 6 3 6 0 30574 30574 30493 30574 30564 30574 30512 30499 30503 30506 1 0 0 58 0 9 0 57 69 68 62 ########################################################## R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] limma_3.12.1 [[alternative HTML version deleted]]

• 1.2k views

ADD COMMENT • link updated 12.6 years ago by James W. MacDonald 68k • written 12.6 years ago by Hoyles, Lesley ▴ 50

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 5 hours ago

United States

Hi Lesley, On 8/9/2012 10:35 AM, Hoyles, Lesley wrote: > Hi > > I am working with single-channel Agilent data. I have two batches of microarrays for biopsies from different patients (batch 1, 25 arrays; batch 2, 34 arrays). There are no replicate samples for any of the patients in either of the batches. The patients are classified according to disease score. Using MDS (and confirmed with PCA) I have found there is a pronounced batch effect. > > Initially, I tried removing the batch effect using METHOD 1. Using this method I found I was getting 1000s rather than the expected 10s or 100s of genes being differentially expressed. However, I was able to demonstrate with MDS that I had eliminated the batch effect as I was able to plot yave. Using METHOD 2, I got the number of genes I expected to see differentially expressed (based on what I'd seen on analyses with the batches treated as separate entities). > > My main questions are: > How can I show I have removed the batch effect from the dataset using METHOD 2, as I have not made any changes to the expression matrix and have incorporated the batch effect into the model? I'm not sure you need to show anything. It is not controversial to use a blocking factor to account for batches. The only issue would arise if say score_1 samples are only found in batch 1 and score_0 samples are only found in batch 2. If you have that sort of situation, then the biological and technical differences are confounded, and you have real problems, as it is not possible to say if any differences are biological and not technical. > > What value of fit2$df.prior is seen as demonstrating more significant differential expression? I have seen on a list answer that 'A larger value will result in more significant differential expression, and a small value will result in less differential expression'. > > I'd also be grateful of an explanation as to why I'm seeing such low adjusted P values using METHOD 1. Well, you shouldn't be using METHOD 1 in the first place. From the help page for removeBatchEffects(): This function is useful for removing batch effects, associated with hybridization time or other technical variables, prior to clustering or unsupervised analysis such as PCA or heatmaps. It is not intended to use with linear modelling. For linear modelling, it is better to include the batch factors in the linear model. Best, Jim > > Thanks in advance for your assistance. > > Best wishes > Lesley > > > METHOD 1 > ######## > condition.f<- factor(targets$score_score, levels = unique(targets$score_score)) > batch.f<- factor(targets$Batch, levels = unique(targets$Batch)) > design<- model.matrix(~0+condition.f) > y0.batch<- removeBatchEffect(y0, batch.f, design=design) > colnames(design)<- levels(condition.f) > yave<- avereps(y0.batch, ID=y0$genes[,"GeneName"]) > fit<- lmFit(yave, design) > contrast.matrix<- makeContrasts(score_1-score_0, score_2-score_0, score_3-score_0, score_2-score_1, score_3-score_1, score_3-score_2, score_5-score_0, score_5-score_1, score_5-score_2, score_5-score_3, levels=design) > contrast.matrix<- makeContrasts(score_1-score_0, score_2-score_0, score_3-score_0, score_2-score_1, score_3-score_1, score_3-score_2, levels=design) > fit2<- contrasts.fit(fit, contrast.matrix) > fit2<- eBayes(fit2) > fit2$df.prior > fit2$s2.prior > > > Summary results from normalized data with controls included > score_1 - score_0 score_2 - score_0 score_3 - score_0 score_2 - score_1 > -1 2556 1736 726 0 > 0 26331 27969 29127 30574 > 1 1687 869 721 0 > score_3 - score_1 score_3 - score_2 score_5 - score_0 score_5 - score_1 > -1 404 0 879 12 > 0 29323 30572 29401 30492 > 1 847 2 294 70 > score_5 - score_2 score_5 - score_3 > -1 29 123 > 0 30463 30365 > 1 82 86 > > > METHOD 2 > ######## > condition.f<- factor(targets$score_score, levels = unique(targets$score_score)) > batch.f<- factor(targets$Batch, levels = unique(targets$Batch)) > design<- model.matrix(~0+condition.f+batch.f) > colnames(design)<- c("score_3", "score_2", "score_1", "score_0", "score_5", "Batch") > y.ave<- avereps(y, ID=y$genes[,"GeneName"]) > fit<- lmFit(y.ave, design) > write.table(fit, file = "QC/Normalized_controls_included/Faulty_remo ved_batch/Faulty_removed_batch_fit.txt", sep = "\t") > contrast.matrix<- makeContrasts(a = score_1-score_0, b = score_2-score_0, c = score_3-score_0, d = score_2-score_1, e = score_3-score_1, f = score_3-score_2, g = score_5-score_0, h = score_5-score_1, i = score_5-score_2, j = score_5-score_3, levels=design) > fit2<- contrasts.fit(fit, contrast.matrix) > fit2<- eBayes(fit2) > fit2$df.prior > fit2$s2.prior > > Summary results from normalized data with controls included > a b c d e f g h i j > -1 0 0 23 0 1 0 5 6 3 6 > 0 30574 30574 30493 30574 30564 30574 30512 30499 30503 30506 > 1 0 0 58 0 9 0 57 69 68 62 > > ########################################################## > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C > [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 > [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] limma_3.12.1 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 12.6 years ago James W. MacDonald 68k

0

Entering edit mode

Hi Jim Thanks so much for the feedback. It's been very helpful. >The only issue would arise if say score_1 samples are only found in >batch 1 and score_0 samples are only found in batch 2. Thankfully, I don't have that problem with my data. Best wishes Lesley ________________________________________ From: James W. MacDonald [jmacdon@uw.edu] Sent: 09 August 2012 16:04 To: Hoyles, Lesley Cc: bioconductor at r-project.org Subject: Re: [BioC] Confirming a batch effect has been removed Hi Lesley, On 8/9/2012 10:35 AM, Hoyles, Lesley wrote: > Hi > > I am working with single-channel Agilent data. I have two batches of microarrays for biopsies from different patients (batch 1, 25 arrays; batch 2, 34 arrays). There are no replicate samples for any of the patients in either of the batches. The patients are classified according to disease score. Using MDS (and confirmed with PCA) I have found there is a pronounced batch effect. > > Initially, I tried removing the batch effect using METHOD 1. Using this method I found I was getting 1000s rather than the expected 10s or 100s of genes being differentially expressed. However, I was able to demonstrate with MDS that I had eliminated the batch effect as I was able to plot yave. Using METHOD 2, I got the number of genes I expected to see differentially expressed (based on what I'd seen on analyses with the batches treated as separate entities). > > My main questions are: > How can I show I have removed the batch effect from the dataset using METHOD 2, as I have not made any changes to the expression matrix and have incorporated the batch effect into the model? I'm not sure you need to show anything. It is not controversial to use a blocking factor to account for batches. The only issue would arise if say score_1 samples are only found in batch 1 and score_0 samples are only found in batch 2. If you have that sort of situation, then the biological and technical differences are confounded, and you have real problems, as it is not possible to say if any differences are biological and not technical. > > What value of fit2$df.prior is seen as demonstrating more significant differential expression? I have seen on a list answer that 'A larger value will result in more significant differential expression, and a small value will result in less differential expression'. > > I'd also be grateful of an explanation as to why I'm seeing such low adjusted P values using METHOD 1. Well, you shouldn't be using METHOD 1 in the first place. From the help page for removeBatchEffects(): This function is useful for removing batch effects, associated with hybridization time or other technical variables, prior to clustering or unsupervised analysis such as PCA or heatmaps. It is not intended to use with linear modelling. For linear modelling, it is better to include the batch factors in the linear model. Best, Jim > > Thanks in advance for your assistance. > > Best wishes > Lesley > > > METHOD 1 > ######## > condition.f<- factor(targets$score_score, levels = unique(targets$score_score)) > batch.f<- factor(targets$Batch, levels = unique(targets$Batch)) > design<- model.matrix(~0+condition.f) > y0.batch<- removeBatchEffect(y0, batch.f, design=design) > colnames(design)<- levels(condition.f) > yave<- avereps(y0.batch, ID=y0$genes[,"GeneName"]) > fit<- lmFit(yave, design) > contrast.matrix<- makeContrasts(score_1-score_0, score_2-score_0, score_3-score_0, score_2-score_1, score_3-score_1, score_3-score_2, score_5-score_0, score_5-score_1, score_5-score_2, score_5-score_3, levels=design) > contrast.matrix<- makeContrasts(score_1-score_0, score_2-score_0, score_3-score_0, score_2-score_1, score_3-score_1, score_3-score_2, levels=design) > fit2<- contrasts.fit(fit, contrast.matrix) > fit2<- eBayes(fit2) > fit2$df.prior > fit2$s2.prior > > > Summary results from normalized data with controls included > score_1 - score_0 score_2 - score_0 score_3 - score_0 score_2 - score_1 > -1 2556 1736 726 0 > 0 26331 27969 29127 30574 > 1 1687 869 721 0 > score_3 - score_1 score_3 - score_2 score_5 - score_0 score_5 - score_1 > -1 404 0 879 12 > 0 29323 30572 29401 30492 > 1 847 2 294 70 > score_5 - score_2 score_5 - score_3 > -1 29 123 > 0 30463 30365 > 1 82 86 > > > METHOD 2 > ######## > condition.f<- factor(targets$score_score, levels = unique(targets$score_score)) > batch.f<- factor(targets$Batch, levels = unique(targets$Batch)) > design<- model.matrix(~0+condition.f+batch.f) > colnames(design)<- c("score_3", "score_2", "score_1", "score_0", "score_5", "Batch") > y.ave<- avereps(y, ID=y$genes[,"GeneName"]) > fit<- lmFit(y.ave, design) > write.table(fit, file = "QC/Normalized_controls_included/Faulty_remo ved_batch/Faulty_removed_batch_fit.txt", sep = "\t") > contrast.matrix<- makeContrasts(a = score_1-score_0, b = score_2-score_0, c = score_3-score_0, d = score_2-score_1, e = score_3-score_1, f = score_3-score_2, g = score_5-score_0, h = score_5-score_1, i = score_5-score_2, j = score_5-score_3, levels=design) > fit2<- contrasts.fit(fit, contrast.matrix) > fit2<- eBayes(fit2) > fit2$df.prior > fit2$s2.prior > > Summary results from normalized data with controls included > a b c d e f g h i j > -1 0 0 23 0 1 0 5 6 3 6 > 0 30574 30574 30493 30574 30564 30574 30512 30499 30503 30506 > 1 0 0 58 0 9 0 57 69 68 62 > > ########################################################## > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C > [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 > [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] limma_3.12.1 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 12.6 years ago Hoyles, Lesley ▴ 50

Login before adding your answer.