Search
Question: limma multiple groups comparison produces different pvalue comparing with two group comparsion
0
6.8 years ago by
Shao, Chunxuan20 wrote:
Hi everyone: I am trying limma for multiple group comparison, and found that the pvalue of pairwise comparison is different from the two group comparsion. A simple example: ## multiple groups sd <- 0.3*sqrt(4/rchisq(100,df=4)) y2 <- matrix(rnorm(100*9,sd=sd),100,9) rownames(y2) <- paste("Gene",1:100) y2[1:2,4:6] <- y2[1:2,4:6] + 2 y2[1:2,7:9] <- y2[1:2,7:9] + 2 f <- factor(c(rep("A",3),rep("B",3),rep("C",3)),levels=c("A","B","C")) design <- model.matrix(~0+f) colnames(design) <- c("A","B","C") fit2 <- lmFit(y2,design) contrast.matrix <- makeContrasts("B-A", "C-A", "C-B", levels=design) fit2 <- contrasts.fit(fit2, contrast.matrix) fit2 <- eBayes(fit2) topTable(fit2, adjust="BH",coef=1) ID logFC t P.Value adj.P.Val B 1 Gene 1 1.8961975 8.697056 1.038672e-05 0.001038672 3.9011567 2 Gene 2 2.3140816 6.104506 1.690945e-04 0.008454727 0.9859443 96 Gene 96 -0.8959120 -4.039216 2.855951e-03 0.095198377 -1.9787088 49 Gene 49 -0.4983110 -2.776108 2.128499e-02 0.462011867 -4.0378662 86 Gene 86 0.7288518 2.726366 2.310059e-02 0.462011867 -4.1198456 74 Gene 74 -0.8456774 -2.367919 4.171492e-02 0.695248750 -4.7045985 50 Gene 50 -1.5597167 -2.107329 6.396187e-02 0.783183079 -5.1177738 46 Gene 46 -0.6530683 -2.028684 7.269084e-02 0.783183079 -5.2394326 8 Gene 8 -0.3348446 -1.986208 7.786841e-02 0.783183079 -5.3044289 89 Gene 89 -0.4574447 -1.942438 8.356956e-02 0.783183079 -5.3708387 ## two groups y <- y2[,1:6] design <- cbind(Grp1=1,Grp2vs1=c(0,0,0,1,1,1)) options(digit=3) fit <- lmFit(y,design) fit <- eBayes(fit) topTable(fit, adjust="BH",coef= "Grp2vs1") ID logFC t P.Value adj.P.Val B 1 Gene 1 1.8961975 7.962269 0.0001316933 0.01316933 1.6748248 2 Gene 2 2.3140816 6.190071 0.0005783671 0.02891836 0.1114107 96 Gene 96 -0.8959120 -4.124112 0.0050946000 0.16982000 -2.2231052 64 Gene 64 -0.6611682 -3.824833 0.0073401109 0.18350277 -2.6140362 86 Gene 86 0.7288518 3.503703 0.0110265957 0.22053191 -3.0478852 49 Gene 49 -0.4983110 -2.922543 0.0239288895 0.39881483 -3.8654279 74 Gene 74 -0.8456774 -2.411705 0.0489411131 0.69915876 -4.6045844 89 Gene 89 -0.4574447 -2.282469 0.0588626872 0.73578359 -4.7916531 50 Gene 50 -1.5597167 -2.064728 0.0804679751 0.79599004 -5.1040241 46 Gene 46 -0.6530683 -2.020227 0.0857867706 0.79599004 -5.1671760 In this case, in the multiple group p_value is much smaller than the two group. Any suggestion to this ? Which is the appropriate way to do if I want to do the pairwise comparison among three groups? Thanks in advance. -- Chunxuan
modified 6.8 years ago by James W. MacDonald48k • written 6.8 years ago by Shao, Chunxuan20
0
6.8 years ago by
United States
James W. MacDonald48k wrote:
Hi Chunxuan, On 3/13/2012 4:53 AM, Shao, Chunxuan wrote: > Hi everyone: > > I am trying limma for multiple group comparison, and found that the pvalue of pairwise comparison is different from the two group comparsion. > > A simple example: > > ## multiple groups > sd<- 0.3*sqrt(4/rchisq(100,df=4)) > y2<- matrix(rnorm(100*9,sd=sd),100,9) > rownames(y2)<- paste("Gene",1:100) > y2[1:2,4:6]<- y2[1:2,4:6] + 2 > y2[1:2,7:9]<- y2[1:2,7:9] + 2 > f<- factor(c(rep("A",3),rep("B",3),rep("C",3)),levels=c("A","B","C")) > design<- model.matrix(~0+f) > colnames(design)<- c("A","B","C") > fit2<- lmFit(y2,design) > contrast.matrix<- makeContrasts("B-A", "C-A", "C-B", levels=design) > fit2<- contrasts.fit(fit2, contrast.matrix) > fit2<- eBayes(fit2) > topTable(fit2, adjust="BH",coef=1) > > ID logFC t P.Value adj.P.Val B > 1 Gene 1 1.8961975 8.697056 1.038672e-05 0.001038672 3.9011567 > 2 Gene 2 2.3140816 6.104506 1.690945e-04 0.008454727 0.9859443 > 96 Gene 96 -0.8959120 -4.039216 2.855951e-03 0.095198377 -1.9787088 > 49 Gene 49 -0.4983110 -2.776108 2.128499e-02 0.462011867 -4.0378662 > 86 Gene 86 0.7288518 2.726366 2.310059e-02 0.462011867 -4.1198456 > 74 Gene 74 -0.8456774 -2.367919 4.171492e-02 0.695248750 -4.7045985 > 50 Gene 50 -1.5597167 -2.107329 6.396187e-02 0.783183079 -5.1177738 > 46 Gene 46 -0.6530683 -2.028684 7.269084e-02 0.783183079 -5.2394326 > 8 Gene 8 -0.3348446 -1.986208 7.786841e-02 0.783183079 -5.3044289 > 89 Gene 89 -0.4574447 -1.942438 8.356956e-02 0.783183079 -5.3708387 > > ## two groups > y<- y2[,1:6] > design<- cbind(Grp1=1,Grp2vs1=c(0,0,0,1,1,1)) > options(digit=3) > fit<- lmFit(y,design) > fit<- eBayes(fit) > topTable(fit, adjust="BH",coef= "Grp2vs1") > > ID logFC t P.Value adj.P.Val B > 1 Gene 1 1.8961975 7.962269 0.0001316933 0.01316933 1.6748248 > 2 Gene 2 2.3140816 6.190071 0.0005783671 0.02891836 0.1114107 > 96 Gene 96 -0.8959120 -4.124112 0.0050946000 0.16982000 -2.2231052 > 64 Gene 64 -0.6611682 -3.824833 0.0073401109 0.18350277 -2.6140362 > 86 Gene 86 0.7288518 3.503703 0.0110265957 0.22053191 -3.0478852 > 49 Gene 49 -0.4983110 -2.922543 0.0239288895 0.39881483 -3.8654279 > 74 Gene 74 -0.8456774 -2.411705 0.0489411131 0.69915876 -4.6045844 > 89 Gene 89 -0.4574447 -2.282469 0.0588626872 0.73578359 -4.7916531 > 50 Gene 50 -1.5597167 -2.064728 0.0804679751 0.79599004 -5.1040241 > 46 Gene 46 -0.6530683 -2.020227 0.0857867706 0.79599004 -5.1671760 > > > In this case, in the multiple group p_value is much smaller than the two group. > Any suggestion to this ? Which is the appropriate way to do if I want to do the pairwise comparison among three groups? Technically either way is appropriate, but the first case will be more powerful. In the second case you are doing a conventional t-test, where the denominator is the standard error of the mean (SEM), computed from the two groups under consideration. In the first case you are fitting a contrast in the context of a larger linear model, in which case the denominator is the sums of squares of error (SSE), which is an estimate of the intra-group variability over all groups. Both the SEM and SSE are measuring the same thing, which is the within group variability. Measures of variability tend to be inefficient statistics, which means that the accuracy of the statistic is highly dependent on the number of observations (and they can be quite inaccurate with few observations). This is the underlying premise of the eBayes step in limma, which uses the variability estimate over all genes (which will be pretty accurate) to adjust the variability estimate of an individual gene (which may not be accurate, especially with few samples). In your case, the SSE computed over three groups will tend to be a more accurate (and generally smaller) measure of the within group variability than the SEM computed over two groups. Since the numerator of your statistic is the same in both cases, computing a smaller denominator will increase the t-statistic, which will result in smaller p-values, and hence more power to detect differences. Best, Jim > > Thanks in advance. > > -- > Chunxuan > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099