limma multiple groups comparison produces different pvalue comparing with two group comparsion

0

Entering edit mode

Shao, Chunxuan ▴ 20

@shao-chunxuan-5160

Last seen 17 months ago

Germany

Hi everyone: I am trying limma for multiple group comparison, and found that the pvalue of pairwise comparison is different from the two group comparsion. A simple example: ## multiple groups sd <- 0.3*sqrt(4/rchisq(100,df=4)) y2 <- matrix(rnorm(100*9,sd=sd),100,9) rownames(y2) <- paste("Gene",1:100) y2[1:2,4:6] <- y2[1:2,4:6] + 2 y2[1:2,7:9] <- y2[1:2,7:9] + 2 f <- factor(c(rep("A",3),rep("B",3),rep("C",3)),levels=c("A","B","C")) design <- model.matrix(~0+f) colnames(design) <- c("A","B","C") fit2 <- lmFit(y2,design) contrast.matrix <- makeContrasts("B-A", "C-A", "C-B", levels=design) fit2 <- contrasts.fit(fit2, contrast.matrix) fit2 <- eBayes(fit2) topTable(fit2, adjust="BH",coef=1) ID logFC t P.Value adj.P.Val B 1 Gene 1 1.8961975 8.697056 1.038672e-05 0.001038672 3.9011567 2 Gene 2 2.3140816 6.104506 1.690945e-04 0.008454727 0.9859443 96 Gene 96 -0.8959120 -4.039216 2.855951e-03 0.095198377 -1.9787088 49 Gene 49 -0.4983110 -2.776108 2.128499e-02 0.462011867 -4.0378662 86 Gene 86 0.7288518 2.726366 2.310059e-02 0.462011867 -4.1198456 74 Gene 74 -0.8456774 -2.367919 4.171492e-02 0.695248750 -4.7045985 50 Gene 50 -1.5597167 -2.107329 6.396187e-02 0.783183079 -5.1177738 46 Gene 46 -0.6530683 -2.028684 7.269084e-02 0.783183079 -5.2394326 8 Gene 8 -0.3348446 -1.986208 7.786841e-02 0.783183079 -5.3044289 89 Gene 89 -0.4574447 -1.942438 8.356956e-02 0.783183079 -5.3708387 ## two groups y <- y2[,1:6] design <- cbind(Grp1=1,Grp2vs1=c(0,0,0,1,1,1)) options(digit=3) fit <- lmFit(y,design) fit <- eBayes(fit) topTable(fit, adjust="BH",coef= "Grp2vs1") ID logFC t P.Value adj.P.Val B 1 Gene 1 1.8961975 7.962269 0.0001316933 0.01316933 1.6748248 2 Gene 2 2.3140816 6.190071 0.0005783671 0.02891836 0.1114107 96 Gene 96 -0.8959120 -4.124112 0.0050946000 0.16982000 -2.2231052 64 Gene 64 -0.6611682 -3.824833 0.0073401109 0.18350277 -2.6140362 86 Gene 86 0.7288518 3.503703 0.0110265957 0.22053191 -3.0478852 49 Gene 49 -0.4983110 -2.922543 0.0239288895 0.39881483 -3.8654279 74 Gene 74 -0.8456774 -2.411705 0.0489411131 0.69915876 -4.6045844 89 Gene 89 -0.4574447 -2.282469 0.0588626872 0.73578359 -4.7916531 50 Gene 50 -1.5597167 -2.064728 0.0804679751 0.79599004 -5.1040241 46 Gene 46 -0.6530683 -2.020227 0.0857867706 0.79599004 -5.1671760 In this case, in the multiple group p_value is much smaller than the two group. Any suggestion to this ? Which is the appropriate way to do if I want to do the pairwise comparison among three groups? Thanks in advance. -- Chunxuan

limma limma • 5.3k views

ADD COMMENT • link updated 12.1 years ago by James W. MacDonald 65k • written 12.1 years ago by Shao, Chunxuan ▴ 20

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 minute ago

United States

Hi Chunxuan, On 3/13/2012 4:53 AM, Shao, Chunxuan wrote: > Hi everyone: > > I am trying limma for multiple group comparison, and found that the pvalue of pairwise comparison is different from the two group comparsion. > > A simple example: > > ## multiple groups > sd<- 0.3*sqrt(4/rchisq(100,df=4)) > y2<- matrix(rnorm(100*9,sd=sd),100,9) > rownames(y2)<- paste("Gene",1:100) > y2[1:2,4:6]<- y2[1:2,4:6] + 2 > y2[1:2,7:9]<- y2[1:2,7:9] + 2 > f<- factor(c(rep("A",3),rep("B",3),rep("C",3)),levels=c("A","B","C")) > design<- model.matrix(~0+f) > colnames(design)<- c("A","B","C") > fit2<- lmFit(y2,design) > contrast.matrix<- makeContrasts("B-A", "C-A", "C-B", levels=design) > fit2<- contrasts.fit(fit2, contrast.matrix) > fit2<- eBayes(fit2) > topTable(fit2, adjust="BH",coef=1) > > ID logFC t P.Value adj.P.Val B > 1 Gene 1 1.8961975 8.697056 1.038672e-05 0.001038672 3.9011567 > 2 Gene 2 2.3140816 6.104506 1.690945e-04 0.008454727 0.9859443 > 96 Gene 96 -0.8959120 -4.039216 2.855951e-03 0.095198377 -1.9787088 > 49 Gene 49 -0.4983110 -2.776108 2.128499e-02 0.462011867 -4.0378662 > 86 Gene 86 0.7288518 2.726366 2.310059e-02 0.462011867 -4.1198456 > 74 Gene 74 -0.8456774 -2.367919 4.171492e-02 0.695248750 -4.7045985 > 50 Gene 50 -1.5597167 -2.107329 6.396187e-02 0.783183079 -5.1177738 > 46 Gene 46 -0.6530683 -2.028684 7.269084e-02 0.783183079 -5.2394326 > 8 Gene 8 -0.3348446 -1.986208 7.786841e-02 0.783183079 -5.3044289 > 89 Gene 89 -0.4574447 -1.942438 8.356956e-02 0.783183079 -5.3708387 > > ## two groups > y<- y2[,1:6] > design<- cbind(Grp1=1,Grp2vs1=c(0,0,0,1,1,1)) > options(digit=3) > fit<- lmFit(y,design) > fit<- eBayes(fit) > topTable(fit, adjust="BH",coef= "Grp2vs1") > > ID logFC t P.Value adj.P.Val B > 1 Gene 1 1.8961975 7.962269 0.0001316933 0.01316933 1.6748248 > 2 Gene 2 2.3140816 6.190071 0.0005783671 0.02891836 0.1114107 > 96 Gene 96 -0.8959120 -4.124112 0.0050946000 0.16982000 -2.2231052 > 64 Gene 64 -0.6611682 -3.824833 0.0073401109 0.18350277 -2.6140362 > 86 Gene 86 0.7288518 3.503703 0.0110265957 0.22053191 -3.0478852 > 49 Gene 49 -0.4983110 -2.922543 0.0239288895 0.39881483 -3.8654279 > 74 Gene 74 -0.8456774 -2.411705 0.0489411131 0.69915876 -4.6045844 > 89 Gene 89 -0.4574447 -2.282469 0.0588626872 0.73578359 -4.7916531 > 50 Gene 50 -1.5597167 -2.064728 0.0804679751 0.79599004 -5.1040241 > 46 Gene 46 -0.6530683 -2.020227 0.0857867706 0.79599004 -5.1671760 > > > In this case, in the multiple group p_value is much smaller than the two group. > Any suggestion to this ? Which is the appropriate way to do if I want to do the pairwise comparison among three groups? Technically either way is appropriate, but the first case will be more powerful. In the second case you are doing a conventional t-test, where the denominator is the standard error of the mean (SEM), computed from the two groups under consideration. In the first case you are fitting a contrast in the context of a larger linear model, in which case the denominator is the sums of squares of error (SSE), which is an estimate of the intra-group variability over all groups. Both the SEM and SSE are measuring the same thing, which is the within group variability. Measures of variability tend to be inefficient statistics, which means that the accuracy of the statistic is highly dependent on the number of observations (and they can be quite inaccurate with few observations). This is the underlying premise of the eBayes step in limma, which uses the variability estimate over all genes (which will be pretty accurate) to adjust the variability estimate of an individual gene (which may not be accurate, especially with few samples). In your case, the SSE computed over three groups will tend to be a more accurate (and generally smaller) measure of the within group variability than the SEM computed over two groups. Since the numerator of your statistic is the same in both cases, computing a smaller denominator will increase the t-statistic, which will result in smaller p-values, and hence more power to detect differences. Best, Jim > > Thanks in advance. > > -- > Chunxuan > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 12.1 years ago James W. MacDonald 65k

0

Entering edit mode

Hi Jim, Thanks very much for the thorough explanation. The pvalue is also smaller in real data, indicating that multiple group Comparison is more reasonable in general. Best, -- Chunxuan On 3/14/12 2:12 PM, "James W. MacDonald" <jmacdon at="" uw.edu=""> wrote: >Hi Chunxuan, > > >On 3/13/2012 4:53 AM, Shao, Chunxuan wrote: >> Hi everyone: >> >> I am trying limma for multiple group comparison, and found that the >>pvalue of pairwise comparison is different from the two group comparsion. >> >> A simple example: >> >> ## multiple groups >> sd<- 0.3*sqrt(4/rchisq(100,df=4)) >> y2<- matrix(rnorm(100*9,sd=sd),100,9) >> rownames(y2)<- paste("Gene",1:100) >> y2[1:2,4:6]<- y2[1:2,4:6] + 2 >> y2[1:2,7:9]<- y2[1:2,7:9] + 2 >> f<- factor(c(rep("A",3),rep("B",3),rep("C",3)),levels=c("A","B","C")) >> design<- model.matrix(~0+f) >> colnames(design)<- c("A","B","C") >> fit2<- lmFit(y2,design) >> contrast.matrix<- makeContrasts("B-A", "C-A", "C-B", levels=design) >> fit2<- contrasts.fit(fit2, contrast.matrix) >> fit2<- eBayes(fit2) >> topTable(fit2, adjust="BH",coef=1) >> >> ID logFC t P.Value adj.P.Val B >> 1 Gene 1 1.8961975 8.697056 1.038672e-05 0.001038672 3.9011567 >> 2 Gene 2 2.3140816 6.104506 1.690945e-04 0.008454727 0.9859443 >> 96 Gene 96 -0.8959120 -4.039216 2.855951e-03 0.095198377 -1.9787088 >> 49 Gene 49 -0.4983110 -2.776108 2.128499e-02 0.462011867 -4.0378662 >> 86 Gene 86 0.7288518 2.726366 2.310059e-02 0.462011867 -4.1198456 >> 74 Gene 74 -0.8456774 -2.367919 4.171492e-02 0.695248750 -4.7045985 >> 50 Gene 50 -1.5597167 -2.107329 6.396187e-02 0.783183079 -5.1177738 >> 46 Gene 46 -0.6530683 -2.028684 7.269084e-02 0.783183079 -5.2394326 >> 8 Gene 8 -0.3348446 -1.986208 7.786841e-02 0.783183079 -5.3044289 >> 89 Gene 89 -0.4574447 -1.942438 8.356956e-02 0.783183079 -5.3708387 >> >> ## two groups >> y<- y2[,1:6] >> design<- cbind(Grp1=1,Grp2vs1=c(0,0,0,1,1,1)) >> options(digit=3) >> fit<- lmFit(y,design) >> fit<- eBayes(fit) >> topTable(fit, adjust="BH",coef= "Grp2vs1") >> >> ID logFC t P.Value adj.P.Val B >> 1 Gene 1 1.8961975 7.962269 0.0001316933 0.01316933 1.6748248 >> 2 Gene 2 2.3140816 6.190071 0.0005783671 0.02891836 0.1114107 >> 96 Gene 96 -0.8959120 -4.124112 0.0050946000 0.16982000 -2.2231052 >> 64 Gene 64 -0.6611682 -3.824833 0.0073401109 0.18350277 -2.6140362 >> 86 Gene 86 0.7288518 3.503703 0.0110265957 0.22053191 -3.0478852 >> 49 Gene 49 -0.4983110 -2.922543 0.0239288895 0.39881483 -3.8654279 >> 74 Gene 74 -0.8456774 -2.411705 0.0489411131 0.69915876 -4.6045844 >> 89 Gene 89 -0.4574447 -2.282469 0.0588626872 0.73578359 -4.7916531 >> 50 Gene 50 -1.5597167 -2.064728 0.0804679751 0.79599004 -5.1040241 >> 46 Gene 46 -0.6530683 -2.020227 0.0857867706 0.79599004 -5.1671760 >> >> >> In this case, in the multiple group p_value is much smaller than the >>two group. >> Any suggestion to this ? Which is the appropriate way to do if I want >>to do the pairwise comparison among three groups? > >Technically either way is appropriate, but the first case will be more >powerful. In the second case you are doing a conventional t-test, where >the denominator is the standard error of the mean (SEM), computed from >the two groups under consideration. In the first case you are fitting a >contrast in the context of a larger linear model, in which case the >denominator is the sums of squares of error (SSE), which is an estimate >of the intra-group variability over all groups. > >Both the SEM and SSE are measuring the same thing, which is the within >group variability. Measures of variability tend to be inefficient >statistics, which means that the accuracy of the statistic is highly >dependent on the number of observations (and they can be quite >inaccurate with few observations). This is the underlying premise of the >eBayes step in limma, which uses the variability estimate over all genes >(which will be pretty accurate) to adjust the variability estimate of an >individual gene (which may not be accurate, especially with few samples). > >In your case, the SSE computed over three groups will tend to be a more >accurate (and generally smaller) measure of the within group variability >than the SEM computed over two groups. Since the numerator of your >statistic is the same in both cases, computing a smaller denominator >will increase the t-statistic, which will result in smaller p-values, >and hence more power to detect differences. > >Best, > >Jim > > >> >> Thanks in advance. >> >> -- >> Chunxuan >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >>http://news.gmane.org/gmane.science.biology.informatics.conductor > >-- >James W. MacDonald, M.S. >Biostatistician >University of Washington >Environmental and Occupational Health Sciences >4225 Roosevelt Way NE, # 100 >Seattle WA 98105-6099 >

ADD REPLY • link 12.1 years ago Shao, Chunxuan ▴ 20

Login before adding your answer.