Question

Limma design : What is the "best" method to infer biological replicates during differentially expression analysis

0

Entering edit mode

Santy Marques-Ladeira ▴ 20

@santy-marques-ladeira-6408

Last seen 11.4 years ago

Dear Limma users, I'm currently working on Affymetrix microarray (Hu-gene 2.1) with only one channel. I got 8 experimental stimuli conducted on 7 donors. During the experimental process each stimulus was conducted at the same time for all the donors so I assume there are no technical replicates, but I need to take in account the "donor effect" (or biological replicates). In the "Limma" documentation there are 2 ways of including this "effect" but it is still not clear to me what is the best choice, and the 2 methods gave me 86% of correlation for a pairwise comparison test. *1/The first method I used was including the donors in my design model :* # I build a matrix with 2 columns (Stimulus + Donor) # Each row is an array sample = matrix(0, nrow=55, ncol=2) rownames(sample) <- colnames("normalised data I use to perform the analysis") colnames(sample) <- c("Stimulus", "Donor") # In the column Stimulus there is directly the name of the stimulus tested sample[1:7,1] = "Stim1" sample[8:14,1] = "Stim2" sample[15:20,1] = "Stim3" sample[21:27,1] = "Stim4" sample[28:34,1] = "Stim5" sample[35:41,1] = "Stim6" sample[42:48,1] = "Stim7" sample[49:55,1] = "Stim8" # In the column Donor there is the number of the donor (from 2 to 8) sample[,2] = c(rep(c(2,3,4,5,6,7,8),2),3,4,5,6,7,8,rep(c(2,3,4,5,6,7,8),5)) targets = as.data.frame(sample) Stimulus = factor(targets$Stimulus, levels = c("Stim1", "Stim2", "Stim3", "Stim4", "Stim5", "Stim6", "Stim7", "Stim8")) Donor = factor(targets$Donor, levels = c("2", "3", "4", "5", "6", "7", "8")) # I use the model.matrix() function to infer the design based on my matrix design <- model.matrix(~-1+Stimulus+Donor) rownames(design) <- colnames("normalised data I use to perform the analysis") colnames(design)[1:8] <- c("Stim1", "Stim2", "Stim3", "Stim4", "Stim5", "Stim6", "Stim7", "Stim8") #This gives me : # Stim1 Stim2 Stim3 ... ... ... Donor3 Donor4 ... ... Donor8 #experience1 1 0 0 0 0 0 #experience2 1 0 0 1 0 0 #experience3 1 0 0 0 1 0 #... #... #... First thing is that because I remove for a stimulus the donor2, this donor is removed from the design model (maybe because of dimension for regression ?) Then I do the usual commands : fit <- lmFit("normalised data I use to perform the analysis", design) cont.matrix<-makeContrasts(Stim2vsStim4=Stim2-Stim4, levels=design) fit2<-contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2) anadiff.limma.Stim2<-topTable(fit2, number=1000, adjust="BH") *2/The second method I used is by treating donors as replicates separately:* I used the exact same "targets" data.frame, except there is no more the Donor column but only the Stimulus column [...] design <- model.matrix(~-1+Stimulus) [...] # List of donors saved as a variable for blocks biolrep = c(rep(c(2,3,4,5,6,7,8),2),3,4,5,6,7,8,rep(c(2,3,4,5,6,7,8),5)) # I used the duplicateCorrelation() function to take in account the link between donors corfit<-duplicateCorrelation(c031, design, ndups=1, block=biolrep) Then I do the usual commands (incorporating the donor information) : fit<-lmFit(c031, design, block=biolrep, cor=corfit$consensus) cont.matrix<-makeContrasts(Stim2vsStim4=Stim2-Stim4, levels=design) fit2<-contrasts.fit(fit, cont.matrix) fit2<-eBayes(fit2) anadiff.limma.Stim2<-topTable(fit2, number=1000, adjust="BH") When I check the results and compare the list of the 1000 genes given by each method, there are 857 genes in common. The fact that I found the use of these two methods on this site doesn't helps me much about choosing one. Does someone got an idea about what method I should use for biological replicates ? (And by the way if someone got an answer more precise about why Donor2 is excluded, that should be nice) Best regards, Santy [[alternative HTML version deleted]]

Microarray limma PROcess Microarray limma PROcess • 2.6k views

ADD COMMENT • link updated 12.0 years ago by Gordon Smyth 53k • written 12.0 years ago by Santy Marques-Ladeira ▴ 20

score 0 · Answer 1 · 2014-02-20

Dear Santy, The first method is twoway anova, a generalization of a paired analysis. The second method is a random effects approach in which the intra- donor correlation is incorporated into the covariance matrix instead of the linear predictor. Both are good methods. The twoway anova approach makes fewer assumptions but the random effects approach is statistically more powerful, particularly for unbalanced designs. For a balanced design in which all donors receive all stimuli, the twoway anovao approach is virtually as powerful as the random effects approach and hence is preferable. For an unbalanced design in which each donor receives only a subset of the stimula, the random effects approach is more powerful. Your experiment is almost completely balanced -- there is just one missing stimuli for one donor. Hence I would use the twoway anova approach. Best wishes Gordon > Date: Tue, 18 Feb 2014 12:05:41 +0100 > From: Santy Marques-Ladeira <santy.marquesladeira at="" gmail.com=""> > To: bioconductor at r-project.org > Subject: [BioC] Limma design : What is the "best" method to infer > biological replicates during differentially expression analysis > > Dear Limma users, > > > > > I'm currently working on Affymetrix microarray (Hu-gene 2.1) with only one > channel. I got 8 experimental stimuli conducted on 7 donors. During the > experimental process each stimulus was conducted at the same time for all > the donors so I assume there are no technical replicates, but I need to > take in account the "donor effect" (or biological replicates). > > In the "Limma" documentation there are 2 ways of including this "effect" > but it is still not clear to me what is the best choice, and the 2 methods > gave me 86% of correlation for a pairwise comparison test. > > > > *1/The first method I used was including the donors in my design model :* > > # I build a matrix with 2 columns (Stimulus + Donor) > # Each row is an array > sample = matrix(0, nrow=55, ncol=2) > rownames(sample) <- colnames("normalised data I use to perform the > analysis") > colnames(sample) <- c("Stimulus", "Donor") > > # In the column Stimulus there is directly the name of the stimulus tested > sample[1:7,1] = "Stim1" > sample[8:14,1] = "Stim2" > sample[15:20,1] = "Stim3" > sample[21:27,1] = "Stim4" > sample[28:34,1] = "Stim5" > sample[35:41,1] = "Stim6" > sample[42:48,1] = "Stim7" > sample[49:55,1] = "Stim8" > # In the column Donor there is the number of the donor (from 2 to 8) > sample[,2] = c(rep(c(2,3,4,5,6,7,8),2),3,4,5,6,7,8,rep(c(2,3,4,5,6,7,8),5)) > targets = as.data.frame(sample) > > Stimulus = factor(targets$Stimulus, levels = c("Stim1", "Stim2", "Stim3", > "Stim4", "Stim5", "Stim6", "Stim7", "Stim8")) > Donor = factor(targets$Donor, levels = c("2", "3", "4", "5", "6", "7", "8")) > > # I use the model.matrix() function to infer the design based on my matrix > design <- model.matrix(~-1+Stimulus+Donor) > rownames(design) <- colnames("normalised data I use to perform the > analysis") > colnames(design)[1:8] <- c("Stim1", "Stim2", "Stim3", "Stim4", "Stim5", > "Stim6", "Stim7", "Stim8") > > #This gives me : > # Stim1 Stim2 Stim3 ... ... ... Donor3 > Donor4 ... ... Donor8 > #experience1 1 0 0 > 0 0 0 > #experience2 1 0 0 > 1 0 0 > #experience3 1 0 0 > 0 1 0 > #... > #... > #... > > First thing is that because I remove for a stimulus the donor2, this donor > is removed from the design model (maybe because of dimension for regression > ?) > > Then I do the usual commands : > fit <- lmFit("normalised data I use to perform the analysis", design) > cont.matrix<-makeContrasts(Stim2vsStim4=Stim2-Stim4, levels=design) > fit2<-contrasts.fit(fit, cont.matrix) > fit2 <- eBayes(fit2) > anadiff.limma.Stim2<-topTable(fit2, number=1000, adjust="BH") > > > *2/The second method I used is by treating donors as replicates separately:* > > I used the exact same "targets" data.frame, except there is no more the > Donor column but only the Stimulus column > [...] > design <- model.matrix(~-1+Stimulus) > [...] > > # List of donors saved as a variable for blocks > biolrep = c(rep(c(2,3,4,5,6,7,8),2),3,4,5,6,7,8,rep(c(2,3,4,5,6,7,8),5)) > > # I used the duplicateCorrelation() function to take in account the link > between donors > corfit<-duplicateCorrelation(c031, design, ndups=1, block=biolrep) > > Then I do the usual commands (incorporating the donor information) : > fit<-lmFit(c031, design, block=biolrep, cor=corfit$consensus) > cont.matrix<-makeContrasts(Stim2vsStim4=Stim2-Stim4, levels=design) > fit2<-contrasts.fit(fit, cont.matrix) > fit2<-eBayes(fit2) > anadiff.limma.Stim2<-topTable(fit2, number=1000, adjust="BH") > > > > When I check the results and compare the list of the 1000 genes given by > each method, there are 857 genes in common. The fact that I found the use > of these two methods on this site doesn't helps me much about choosing one. > Does someone got an idea about what method I should use for biological > replicates ? (And by the way if someone got an answer more precise about > why Donor2 is excluded, that should be nice) > > > > > Best regards, > Santy ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

score 0 · Answer 2 · 2014-02-20

> Date: Tue, 18 Feb 2014 12:05:41 +0100 > From: Santy Marques-Ladeira <santy.marquesladeira at="" gmail.com=""> > To: bioconductor at r-project.org > Subject: [BioC] Limma design : What is the "best" method to infer > biological replicates during differentially expression analysis > > Dear Limma users, > > I'm currently working on Affymetrix microarray (Hu-gene 2.1) with only one > channel. I got 8 experimental stimuli conducted on 7 donors. During the > experimental process each stimulus was conducted at the same time for all > the donors so I assume there are no technical replicates, but I need to > take in account the "donor effect" (or biological replicates). > > In the "Limma" documentation there are 2 ways of including this "effect" > but it is still not clear to me what is the best choice, and the 2 methods > gave me 86% of correlation for a pairwise comparison test. > > > > *1/The first method I used was including the donors in my design model :* > > # I build a matrix with 2 columns (Stimulus + Donor) > # Each row is an array > sample = matrix(0, nrow=55, ncol=2) > rownames(sample) <- colnames("normalised data I use to perform the > analysis") > colnames(sample) <- c("Stimulus", "Donor") > > # In the column Stimulus there is directly the name of the stimulus tested > sample[1:7,1] = "Stim1" > sample[8:14,1] = "Stim2" > sample[15:20,1] = "Stim3" > sample[21:27,1] = "Stim4" > sample[28:34,1] = "Stim5" > sample[35:41,1] = "Stim6" > sample[42:48,1] = "Stim7" > sample[49:55,1] = "Stim8" > # In the column Donor there is the number of the donor (from 2 to 8) > sample[,2] = c(rep(c(2,3,4,5,6,7,8),2),3,4,5,6,7,8,rep(c(2,3,4,5,6,7,8),5)) > targets = as.data.frame(sample) > > Stimulus = factor(targets$Stimulus, levels = c("Stim1", "Stim2", "Stim3", > "Stim4", "Stim5", "Stim6", "Stim7", "Stim8")) > Donor = factor(targets$Donor, levels = c("2", "3", "4", "5", "6", "7", "8")) > > # I use the model.matrix() function to infer the design based on my matrix > design <- model.matrix(~-1+Stimulus+Donor) > rownames(design) <- colnames("normalised data I use to perform the > analysis") > colnames(design)[1:8] <- c("Stim1", "Stim2", "Stim3", "Stim4", "Stim5", > "Stim6", "Stim7", "Stim8") > > #This gives me : > # Stim1 Stim2 Stim3 ... ... ... Donor3 > Donor4 ... ... Donor8 > #experience1 1 0 0 > 0 0 0 > #experience2 1 0 0 > 1 0 0 > #experience3 1 0 0 > 0 1 0 > #... > #... > #... > > First thing is that because I remove for a stimulus the donor2, this donor > is removed from the design model (maybe because of dimension for regression > ?) The design matrix can't have more columns than 1 + nlevels(Stimulus)-1 + nlevels(Donor)-1 You have 7 donors, but you can only add 6 new columns to the design matirx when you add Donor to the linear model. By default, R removes the first level of the factor and makes the coefficients for the other levels relative to the first level. When you add an new factor to the model, you are estimating differences between the levels of the factor, but you can't estimate the overall mean more than once. Gordon ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}