Dear all,
I am having some conceptual issues with the contrasts generated by limma. From my understanding, contrasts acts as an method for defining comparisons between groups. For example, we have expression data of six samples with two samples being controls ("trt1"), two samples being treated with drug A ("trt2") and the remaining two samples being treated with drug B ("trt3"). Here I would like to compare the following:
- control versus drug A (trt1 vs trt2)
- control versus drug B (trt1 vs trt3)
- drug A versus drug B (trt2 vs trt3) - Not as important, but nonetheless interesting
To this end, I thought of generating contrasts using the makeContrasts() function in limma. However, I thought that defining contrasts was conceptually the same as extracting the expression data of the samples trt1 and trt2, trt1 and trt3 and trt2 and trt3 and generating a linear model using lmFit() for each comparison separately. Could anybody tell me what causes the discrepancy or where I could find documentation that describes the difference?
Sample data:
set.seed(1) #Simulate data Mvals <- matrix(rnorm(n = 36), ncol = 6) Mvals[,3:4] <- Mvals[,3:4] + 5 Mvals[,5:6] <- Mvals[,5:6] + 10 colnames(Mvals) <- c("sam1_trt1", "sam2_trt1", "sam3_trt2", "sam4_trt2", "sam5_trt3", "sam6_trt3") #Factor of interest fac_int <- as.factor(rep(c("trt1", "trt2", "trt3"), each = 2)) #Contrasts using limma design <- model.matrix(~0 + fac_int) colnames(design) <- c("trt1", "trt2", "trt3") lmfit <- lmFit(Mvals, design) cont <- makeContrasts(trt1-trt2, trt1-trt3, trt2-trt3, levels = design) lmfit.cont <- contrasts.fit(lmfit, cont) lmfit.cont.ebayes <- eBayes(lmfit.cont) lmfit.cont.ebayes$p.value #Contrasts by extracting the correct Mvals and design #trt1-trt2 M12 <- Mvals[,which(fac_int %in% c("trt1", "trt2"))] design12 <- model.matrix(~as.factor(rep(c("trt1", "trt2"), each = 2))) lmfit12 <- lmFit(M12, design12) lmfit12.ebayes <- eBayes(lmfit12) lmfit12.ebayes$p.value[,2] #trt1-trt3 M13 <- Mvals[,which(fac_int %in% c("trt1", "trt3"))] design13 <- model.matrix(~as.factor(rep(c("trt1", "trt3"), each = 2))) lmfit13 <- lmFit(M13, design13) lmfit13.ebayes <- eBayes(lmfit13) lmfit13.ebayes$p.value[,2] #trt2-trt3 M23 <- Mvals[,which(fac_int %in% c("trt2", "trt3"))] design23 <- model.matrix(~as.factor(rep(c("trt2", "trt3"), each = 2))) lmfit23 <- lmFit(M23, design23) lmfit23.ebayes <- eBayes(lmfit23) lmfit23.ebayes$p.value[,2] #Compare the contrast with the extraction lmfit.cont.ebayes$p.value # trt1 - trt2 trt1 - trt3 trt2 - trt3 # [1,] 0.0001299617 5.130891e-09 3.057964e-05 # [2,] 0.0025873059 5.503494e-08 4.186192e-05 # [3,] 0.0000182689 1.659910e-08 1.207367e-03 # [4,] 0.0003252686 2.228194e-07 1.930707e-03 # [5,] 0.0011712039 4.689758e-07 1.342449e-03 # [6,] 0.0003469217 1.733174e-08 6.331602e-05 data.frame(trt1_trt2 = lmfit12.ebayes$p.value[,2], trt1_trt3 = lmfit13.ebayes$p.value[,2], trt2_trt3 = lmfit23.ebayes$p.value[,2]) # trt1_trt2 trt1_trt3 trt2_trt3 #1 0.0019729784 3.487858e-08 7.756619e-05 #2 0.0149031837 1.962147e-07 1.040529e-04 #3 0.0005273948 8.156824e-08 1.523928e-03 #4 0.0036619967 5.550043e-07 2.320744e-03 #5 0.0087060279 9.729137e-07 1.674585e-03 #6 0.0038247856 8.417320e-08 1.459670e-04 sessionInfo() # R version 3.2.2 (2015-08-14) # Platform: x86_64-apple-darwin13.4.0 (64-bit) # Running under: OS X 10.11 (El Capitan) # # locale: # [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 # # attached base packages: # [1] stats graphics grDevices utils datasets methods base # # other attached packages: # [1] limma_3.24.15 BayesFactor_0.9.12-2 Matrix_1.2-2 coda_0.17-1 # # loaded via a namespace (and not attached): # [1] magrittr_1.5 tools_3.2.2 pbapply_1.1-1 MatrixModels_0.4-1 Rcpp_0.12.1 # [6] mvtnorm_1.0-3 stringi_0.5-5 grid_3.2.2 stringr_1.0.0 gtools_3.5.0 # [11] lattice_0.20-33
Thank you for your answer. I am still trying to wrap my head around this concept. I am still struggling with the idea of why "trt3" influences the estimated variance of "trt1" with "trt2"? Is this similar to the question described in the link below?
https://stat.ethz.ch/pipermail/bioconductor/2012-August/047785.html
I am relatively new to statistics, which might explain my seeming ignorance with regard to this issue. Under what circumstance could you imagine that manually partitioning the data would be more favorable to generating contrasts?
The variance for each gene is not estimated for one group relative to another. Rather, it is estimated from the entire design, using information from all groups. No matter what DE comparisons you perform, you'll be using the same variance estimate. So, the behaviour of
trt3
will affect a DE comparison between the other two groups.As for your other question; the most obvious motivation for partitioning would be if some groups had unequal variances, such that you wouldn't want to analyze them together for fear of getting an inappropriate variance estimate. However, this effect can probably be ignored - see A: Correct assumptions of using limma moderated t-test for a good explanation. There may well be other scenarios where partitioning is appropriate, but yours is not one of them.