Question: limma design and contrast matrix for paired experiment
0
6.4 years ago by
Gordon Smyth37k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth37k wrote:
Dear David, No, your design matrix is incorrect because it ignores the pairing by sample in your experimental design. You can treat Sample as either a fixed or a random factor. The fixed approach is closely analogous to a classical paired t-test. For an experiment like yours, the fixed approach is explained in Section 3.5 of the edgeR User's Guide: http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst /doc/edgeRUsersGuide.pdf You can follow the same construction of the design matrix even though you are using limma. The random approach is a bit more aggressive. For an experiment like yours, the random approach is explained in Section 8.7 of the limma User's Guide: http://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst /doc/usersguide.pdf I would probably recommend the first approach for your data. However the second approach is necessary if you want to test for differences between the two cell lines. Best wishes Gordon > Date: Fri, 11 Jan 2013 18:31:23 +0100 > From: David Westergaard <david at="" harsk.dk=""> > To: bioconductor at r-project.org > Subject: [BioC] limma design and contrast matrix for paired > experiment > > Hello, > > I am analysing microarray data performed on two cell cultures, in > which the gene expression were measured before (C) and after treatment > (T), so that the targets look like this: > > Cell-line Treatment Sample > Cell 1 C 1 > Cell 1 T 1 > Cell 2 C 2 > Cell 2 T 2 > Cell 1 C 3 > Cell 1 T 3 > Cell 2 C 4 > Cell 2 T 4 > Cell 1 C 5 > Cell 1 T 5 > Cell 2 C 6 > Cell 2 T 6 > > All experiments were performed on single-channel Agilent arrays, with > 4 samples pr. slide. I am interested in determining the differentially > expressed genes between Cell1 before and after treatment, as well as > Cell2 before and after treatment. This is the preliminary code: > > # Load and normalize data > RG <- read.maimages(targets$FileName,source="agilent.median",green.o nly=TRUE) > # Assume there is a col called FileName in the targets section > RG <- backgroundCorrect(RG, method="normexp", offset=16) > RGNorm <- normalizeBetweenArrays(RG, method="quantile") > RGNorm.ave <- avereps(RGNorm, ID=RGNorm$genes$ProbeName) > > # Create design > Pairing <- paste(rep(c('C1-','C1-','C2-','C2-'),3),c(1,1,2,2,1,1,2,2 ,1,1,2,2),rep(c('C','T'),6),sep='') > pair <- factor(Pairing,levels=unique(Pairing)) > design <- model.matrix( ~ 0 + pair ) > colnames(design) <- c('C1.C','C1.T','C2.C','C2.T') > > # Fit data > fit <- lmFit(RGNorm.ave, design=design) > > cont.matrix <- makeContrasts(C1 = C1.T-C1.C, C2=C2.T-C2.C, levels=design) > fit2 <- contrasts.fit(fit, cont.matrix) > fit2 <- eBayes(fit2) > > > For the experiment, is the design/contrast matrix a proper choice to > answer the questions of 'Which genes are differentially expressed in > Cell 1' and 'Which genes are differentially expressed in Cell 2'? > Further, should I do any technical correction, such as > duplicateCorrelation or similar? The reason I am asking is that even > at p<=0.01 I am getting a very high number of differentially expressed > probes (4500ish for Cell 1, and 7500ish for Cell 2, respectively), and > I want to make sure this is biological significance, and not some > technical aspect I have missed. > > Thanks in advance. > > Best, > David > > >> sessionInfo() > R version 2.14.1 (2011-12-22) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] statmod_1.4.16 limma_3.10.3 > > loaded via a namespace (and not attached): > [1] tcltk_2.14.1 tools_2.14.1 > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} microarray limma edger • 4.1k views ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by Gordon Smyth37k Answer: limma design and contrast matrix for paired experiment 0 6.4 years ago by David Westergaard280 wrote: Dear Gordon, Thank you for your reply. Following the example from the edgeR users guide, I revised the model: CellLine <- factor(rep(c('C1','C1','C2','C2'),3),levels=c('C1','C2')) Sample <- factor(c(1,1,2,2,3,3,4,4,5,5,6,6),levels=c(1,2,3,4,5,6)) Treatment <- factor(rep(c('C','T'),6),levels=c('C','T')) design <- model.matrix( ~ CellLine + CellLine:Sample + CellLine:Treatment ) I am however, not, a bit unsure of which coefficients I am interested in. It seems logical that I should be interested in 'CellLineC1:TreatmentT' and 'CellLineC2:TreatmentT' to answer the question of 'Which genes are differentially expressed in CellLine 1 due to treatment X' and 'Which genes are differentially expressed in CellLine 2 due to treatment X'. Is that correct? Best, David 2013/1/13 Gordon K Smyth <smyth at="" wehi.edu.au="">: > Dear David, > > No, your design matrix is incorrect because it ignores the pairing by sample > in your experimental design. > > You can treat Sample as either a fixed or a random factor. The fixed > approach is closely analogous to a classical paired t-test. For an > experiment like yours, the fixed approach is explained in Section 3.5 of the > edgeR User's Guide: > > http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/in st/doc/edgeRUsersGuide.pdf > > You can follow the same construction of the design matrix even though you > are using limma. > > The random approach is a bit more aggressive. For an experiment like yours, > the random approach is explained in Section 8.7 of the limma User's Guide: > > http://www.bioconductor.org/packages/release/bioc/vignettes/limma/in st/doc/usersguide.pdf > > I would probably recommend the first approach for your data. However the > second approach is necessary if you want to test for differences between the > two cell lines. > > Best wishes > Gordon > >> Date: Fri, 11 Jan 2013 18:31:23 +0100 >> From: David Westergaard <david at="" harsk.dk=""> >> To: bioconductor at r-project.org >> Subject: [BioC] limma design and contrast matrix for paired >> experiment >> >> Hello, >> >> I am analysing microarray data performed on two cell cultures, in >> which the gene expression were measured before (C) and after treatment >> (T), so that the targets look like this: >> >> Cell-line Treatment Sample >> Cell 1 C 1 >> Cell 1 T 1 >> Cell 2 C 2 >> Cell 2 T 2 >> Cell 1 C 3 >> Cell 1 T 3 >> Cell 2 C 4 >> Cell 2 T 4 >> Cell 1 C 5 >> Cell 1 T 5 >> Cell 2 C 6 >> Cell 2 T 6 >> >> All experiments were performed on single-channel Agilent arrays, with >> 4 samples pr. slide. I am interested in determining the differentially >> expressed genes between Cell1 before and after treatment, as well as >> Cell2 before and after treatment. This is the preliminary code: >> >> # Load and normalize data >> RG <- >> read.maimages(targets$FileName,source="agilent.median",green.only=T RUE) >> # Assume there is a col called FileName in the targets section >> RG <- backgroundCorrect(RG, method="normexp", offset=16) >> RGNorm <- normalizeBetweenArrays(RG, method="quantile") >> RGNorm.ave <- avereps(RGNorm, ID=RGNorm$genes$ProbeName) >> >> # Create design >> Pairing <- >> paste(rep(c('C1-','C1-','C2-','C2-'),3),c(1,1,2,2,1,1,2,2,1,1,2,2), rep(c('C','T'),6),sep='') >> pair <- factor(Pairing,levels=unique(Pairing)) >> design <- model.matrix( ~ 0 + pair ) >> colnames(design) <- c('C1.C','C1.T','C2.C','C2.T') >> >> # Fit data >> fit <- lmFit(RGNorm.ave, design=design) >> >> cont.matrix <- makeContrasts(C1 = C1.T-C1.C, C2=C2.T-C2.C, levels=design) >> fit2 <- contrasts.fit(fit, cont.matrix) >> fit2 <- eBayes(fit2) >> >> >> For the experiment, is the design/contrast matrix a proper choice to >> answer the questions of 'Which genes are differentially expressed in >> Cell 1' and 'Which genes are differentially expressed in Cell 2'? >> Further, should I do any technical correction, such as >> duplicateCorrelation or similar? The reason I am asking is that even >> at p<=0.01 I am getting a very high number of differentially expressed >> probes (4500ish for Cell 1, and 7500ish for Cell 2, respectively), and >> I want to make sure this is biological significance, and not some >> technical aspect I have missed. >> >> Thanks in advance. >> >> Best, >> David >> >> >>> sessionInfo() >> >> R version 2.14.1 (2011-12-22) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] statmod_1.4.16 limma_3.10.3 >> >> loaded via a namespace (and not attached): >> [1] tcltk_2.14.1 tools_2.14.1 >> > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:6}}
Yes. Gordon On Mon, 14 Jan 2013, David Westergaard wrote: > Dear Gordon, > > Thank you for your reply. Following the example from the edgeR users > guide, I revised the model: > > CellLine <- factor(rep(c('C1','C1','C2','C2'),3),levels=c('C1','C2')) > Sample <- factor(c(1,1,2,2,3,3,4,4,5,5,6,6),levels=c(1,2,3,4,5,6)) > Treatment <- factor(rep(c('C','T'),6),levels=c('C','T')) > > design <- model.matrix( ~ CellLine + CellLine:Sample + CellLine:Treatment ) > > I am however, not, a bit unsure of which coefficients I am interested > in. It seems logical that I should be interested in > 'CellLineC1:TreatmentT' and 'CellLineC2:TreatmentT' to answer the > question of 'Which genes are differentially expressed in CellLine 1 > due to treatment X' and 'Which genes are differentially expressed in > CellLine 2 due to treatment X'. Is that correct? > > Best, > David > > 2013/1/13 Gordon K Smyth <smyth at="" wehi.edu.au="">: >> Dear David, >> >> No, your design matrix is incorrect because it ignores the pairing by sample >> in your experimental design. >> >> You can treat Sample as either a fixed or a random factor. The fixed >> approach is closely analogous to a classical paired t-test. For an >> experiment like yours, the fixed approach is explained in Section 3.5 of the >> edgeR User's Guide: >> >> http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/i nst/doc/edgeRUsersGuide.pdf >> >> You can follow the same construction of the design matrix even though you >> are using limma. >> >> The random approach is a bit more aggressive. For an experiment like yours, >> the random approach is explained in Section 8.7 of the limma User's Guide: >> >> http://www.bioconductor.org/packages/release/bioc/vignettes/limma/i nst/doc/usersguide.pdf >> >> I would probably recommend the first approach for your data. However the >> second approach is necessary if you want to test for differences between the >> two cell lines. >> >> Best wishes >> Gordon >> >>> Date: Fri, 11 Jan 2013 18:31:23 +0100 >>> From: David Westergaard <david at="" harsk.dk=""> >>> To: bioconductor at r-project.org >>> Subject: [BioC] limma design and contrast matrix for paired >>> experiment >>> >>> Hello, >>> >>> I am analysing microarray data performed on two cell cultures, in >>> which the gene expression were measured before (C) and after treatment >>> (T), so that the targets look like this: >>> >>> Cell-line Treatment Sample >>> Cell 1 C 1 >>> Cell 1 T 1 >>> Cell 2 C 2 >>> Cell 2 T 2 >>> Cell 1 C 3 >>> Cell 1 T 3 >>> Cell 2 C 4 >>> Cell 2 T 4 >>> Cell 1 C 5 >>> Cell 1 T 5 >>> Cell 2 C 6 >>> Cell 2 T 6 >>> >>> All experiments were performed on single-channel Agilent arrays, with >>> 4 samples pr. slide. I am interested in determining the differentially >>> expressed genes between Cell1 before and after treatment, as well as >>> Cell2 before and after treatment. This is the preliminary code: >>> >>> # Load and normalize data >>> RG <- >>> read.maimages(targets$FileName,source="agilent.median",green.only= TRUE) >>> # Assume there is a col called FileName in the targets section >>> RG <- backgroundCorrect(RG, method="normexp", offset=16) >>> RGNorm <- normalizeBetweenArrays(RG, method="quantile") >>> RGNorm.ave <- avereps(RGNorm, ID=RGNorm$genes$ProbeName) >>> >>> # Create design >>> Pairing <- >>> paste(rep(c('C1-','C1-','C2-','C2-'),3),c(1,1,2,2,1,1,2,2,1,1,2,2) ,rep(c('C','T'),6),sep='') >>> pair <- factor(Pairing,levels=unique(Pairing)) >>> design <- model.matrix( ~ 0 + pair ) >>> colnames(design) <- c('C1.C','C1.T','C2.C','C2.T') >>> >>> # Fit data >>> fit <- lmFit(RGNorm.ave, design=design) >>> >>> cont.matrix <- makeContrasts(C1 = C1.T-C1.C, C2=C2.T-C2.C, levels=design) >>> fit2 <- contrasts.fit(fit, cont.matrix) >>> fit2 <- eBayes(fit2) >>> >>> >>> For the experiment, is the design/contrast matrix a proper choice to >>> answer the questions of 'Which genes are differentially expressed in >>> Cell 1' and 'Which genes are differentially expressed in Cell 2'? >>> Further, should I do any technical correction, such as >>> duplicateCorrelation or similar? The reason I am asking is that even >>> at p<=0.01 I am getting a very high number of differentially expressed >>> probes (4500ish for Cell 1, and 7500ish for Cell 2, respectively), and >>> I want to make sure this is biological significance, and not some >>> technical aspect I have missed. >>> >>> Thanks in advance. >>> >>> Best, >>> David >>> >>> >>>> sessionInfo() >>> >>> R version 2.14.1 (2011-12-22) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] statmod_1.4.16 limma_3.10.3 >>> >>> loaded via a namespace (and not attached): >>> [1] tcltk_2.14.1 tools_2.14.1 >>> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the >> addressee. >> You must not disclose, forward, print or use it without the permission of >> the sender. >> ______________________________________________________________________ > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} ADD REPLYlink written 6.4 years ago by Gordon Smyth37k Answer: limma design and contrast matrix for paired experiment 0 6.4 years ago by Gordon Smyth37k Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia Gordon Smyth37k wrote: Dear David, You have not followed the advice in the edgeR user's guide that "The design matrix will be easier to construct in R if we re-number the patients within each disease group:" limma will still give you correct results, but you will have extra non-estimable coefficients in your linear model. Best wishes Gordon > Date: Mon, 14 Jan 2013 00:14:18 +0100 > From: David Westergaard <david at="" harsk.dk=""> > To: Gordon K Smyth <smyth at="" wehi.edu.au=""> > Cc: Bioconductor mailing list <bioconductor at="" r-project.org=""> > Subject: Re: [BioC] limma design and contrast matrix for paired > experiment > > Dear Gordon, > > Thank you for your reply. Following the example from the edgeR users > guide, I revised the model: > > CellLine <- factor(rep(c('C1','C1','C2','C2'),3),levels=c('C1','C2')) > Sample <- factor(c(1,1,2,2,3,3,4,4,5,5,6,6),levels=c(1,2,3,4,5,6)) > Treatment <- factor(rep(c('C','T'),6),levels=c('C','T')) > > design <- model.matrix( ~ CellLine + CellLine:Sample + CellLine:Treatment ) > > I am however, not, a bit unsure of which coefficients I am interested > in. It seems logical that I should be interested in > 'CellLineC1:TreatmentT' and 'CellLineC2:TreatmentT' to answer the > question of 'Which genes are differentially expressed in CellLine 1 > due to treatment X' and 'Which genes are differentially expressed in > CellLine 2 due to treatment X'. Is that correct? > > Best, > David > > 2013/1/13 Gordon K Smyth <smyth at="" wehi.edu.au="">: >> Dear David, >> >> No, your design matrix is incorrect because it ignores the pairing by sample >> in your experimental design. >> >> You can treat Sample as either a fixed or a random factor. The fixed >> approach is closely analogous to a classical paired t-test. For an >> experiment like yours, the fixed approach is explained in Section 3.5 of the >> edgeR User's Guide: >> >> http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/i nst/doc/edgeRUsersGuide.pdf >> >> You can follow the same construction of the design matrix even though you >> are using limma. >> >> The random approach is a bit more aggressive. For an experiment like yours, >> the random approach is explained in Section 8.7 of the limma User's Guide: >> >> http://www.bioconductor.org/packages/release/bioc/vignettes/limma/i nst/doc/usersguide.pdf >> >> I would probably recommend the first approach for your data. However the >> second approach is necessary if you want to test for differences between the >> two cell lines. >> >> Best wishes >> Gordon >> >>> Date: Fri, 11 Jan 2013 18:31:23 +0100 >>> From: David Westergaard <david at="" harsk.dk=""> >>> To: bioconductor at r-project.org >>> Subject: [BioC] limma design and contrast matrix for paired >>> experiment >>> >>> Hello, >>> >>> I am analysing microarray data performed on two cell cultures, in >>> which the gene expression were measured before (C) and after treatment >>> (T), so that the targets look like this: >>> >>> Cell-line Treatment Sample >>> Cell 1 C 1 >>> Cell 1 T 1 >>> Cell 2 C 2 >>> Cell 2 T 2 >>> Cell 1 C 3 >>> Cell 1 T 3 >>> Cell 2 C 4 >>> Cell 2 T 4 >>> Cell 1 C 5 >>> Cell 1 T 5 >>> Cell 2 C 6 >>> Cell 2 T 6 >>> >>> All experiments were performed on single-channel Agilent arrays, with >>> 4 samples pr. slide. I am interested in determining the differentially >>> expressed genes between Cell1 before and after treatment, as well as >>> Cell2 before and after treatment. This is the preliminary code: >>> >>> # Load and normalize data >>> RG <- >>> read.maimages(targets$FileName,source="agilent.median",green.only= TRUE) >>> # Assume there is a col called FileName in the targets section >>> RG <- backgroundCorrect(RG, method="normexp", offset=16) >>> RGNorm <- normalizeBetweenArrays(RG, method="quantile") >>> RGNorm.ave <- avereps(RGNorm, ID=RGNorm$genes$ProbeName) >>> >>> # Create design >>> Pairing <- >>> paste(rep(c('C1-','C1-','C2-','C2-'),3),c(1,1,2,2,1,1,2,2,1,1,2,2) ,rep(c('C','T'),6),sep='') >>> pair <- factor(Pairing,levels=unique(Pairing)) >>> design <- model.matrix( ~ 0 + pair ) >>> colnames(design) <- c('C1.C','C1.T','C2.C','C2.T') >>> >>> # Fit data >>> fit <- lmFit(RGNorm.ave, design=design) >>> >>> cont.matrix <- makeContrasts(C1 = C1.T-C1.C, C2=C2.T-C2.C, levels=design) >>> fit2 <- contrasts.fit(fit, cont.matrix) >>> fit2 <- eBayes(fit2) >>> >>> >>> For the experiment, is the design/contrast matrix a proper choice to >>> answer the questions of 'Which genes are differentially expressed in >>> Cell 1' and 'Which genes are differentially expressed in Cell 2'? >>> Further, should I do any technical correction, such as >>> duplicateCorrelation or similar? The reason I am asking is that even >>> at p<=0.01 I am getting a very high number of differentially expressed >>> probes (4500ish for Cell 1, and 7500ish for Cell 2, respectively), and >>> I want to make sure this is biological significance, and not some >>> technical aspect I have missed. >>> >>> Thanks in advance. >>> >>> Best, >>> David >>> >>> >>>> sessionInfo() >>> >>> R version 2.14.1 (2011-12-22) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] statmod_1.4.16 limma_3.10.3 >>> >>> loaded via a namespace (and not attached): >>> [1] tcltk_2.14.1 tools_2.14.1 >>> >> ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}