Question: Understanding contrasts limma
3
3.9 years ago by
Andy9160
Netherlands
Andy9160 wrote:

Dear all,

I am having some conceptual issues with the contrasts generated by limma. From my understanding, contrasts acts as an method for defining comparisons between groups. For example, we have expression data of six samples with two samples being controls ("trt1"), two samples being treated with drug A ("trt2") and the remaining two samples being treated with drug B ("trt3"). Here I would like to compare the following:

• control versus drug A (trt1 vs trt2)
• control versus drug B (trt1 vs trt3)
• drug A versus drug B (trt2 vs trt3) - Not as important, but nonetheless interesting

To this end, I thought of generating contrasts using the makeContrasts() function in limma. However, I thought that defining contrasts was conceptually the same as extracting the expression data of the samples trt1 and trt2, trt1 and trt3 and trt2 and trt3 and generating a linear model using lmFit() for each comparison separately. Could anybody tell me what causes the discrepancy or where I could find documentation that describes the difference?

Sample data:

set.seed(1)

#Simulate data
Mvals <- matrix(rnorm(n = 36), ncol = 6)
Mvals[,3:4] <- Mvals[,3:4] + 5
Mvals[,5:6] <- Mvals[,5:6] + 10
colnames(Mvals) <- c("sam1_trt1", "sam2_trt1", "sam3_trt2", "sam4_trt2", "sam5_trt3", "sam6_trt3")

#Factor of interest
fac_int <- as.factor(rep(c("trt1", "trt2", "trt3"), each = 2))

#Contrasts using limma
design <- model.matrix(~0 + fac_int)
colnames(design) <- c("trt1", "trt2", "trt3")
lmfit <- lmFit(Mvals, design)
cont <- makeContrasts(trt1-trt2, trt1-trt3, trt2-trt3, levels = design)
lmfit.cont <- contrasts.fit(lmfit, cont)
lmfit.cont.ebayes <- eBayes(lmfit.cont)
lmfit.cont.ebayes$p.value #Contrasts by extracting the correct Mvals and design #trt1-trt2 M12 <- Mvals[,which(fac_int %in% c("trt1", "trt2"))] design12 <- model.matrix(~as.factor(rep(c("trt1", "trt2"), each = 2))) lmfit12 <- lmFit(M12, design12) lmfit12.ebayes <- eBayes(lmfit12) lmfit12.ebayes$p.value[,2]

#trt1-trt3
M13 <- Mvals[,which(fac_int %in% c("trt1", "trt3"))]
design13 <- model.matrix(~as.factor(rep(c("trt1", "trt3"), each = 2)))
lmfit13 <- lmFit(M13, design13)
lmfit13.ebayes <- eBayes(lmfit13)
lmfit13.ebayes$p.value[,2] #trt2-trt3 M23 <- Mvals[,which(fac_int %in% c("trt2", "trt3"))] design23 <- model.matrix(~as.factor(rep(c("trt2", "trt3"), each = 2))) lmfit23 <- lmFit(M23, design23) lmfit23.ebayes <- eBayes(lmfit23) lmfit23.ebayes$p.value[,2]

#Compare the contrast with the extraction
lmfit.cont.ebayes$p.value # trt1 - trt2 trt1 - trt3 trt2 - trt3 # [1,] 0.0001299617 5.130891e-09 3.057964e-05 # [2,] 0.0025873059 5.503494e-08 4.186192e-05 # [3,] 0.0000182689 1.659910e-08 1.207367e-03 # [4,] 0.0003252686 2.228194e-07 1.930707e-03 # [5,] 0.0011712039 4.689758e-07 1.342449e-03 # [6,] 0.0003469217 1.733174e-08 6.331602e-05 data.frame(trt1_trt2 = lmfit12.ebayes$p.value[,2], trt1_trt3 = lmfit13.ebayes$p.value[,2], trt2_trt3 = lmfit23.ebayes$p.value[,2])

#     trt1_trt2    trt1_trt3    trt2_trt3
#1 0.0019729784 3.487858e-08 7.756619e-05
#2 0.0149031837 1.962147e-07 1.040529e-04
#3 0.0005273948 8.156824e-08 1.523928e-03
#4 0.0036619967 5.550043e-07 2.320744e-03
#5 0.0087060279 9.729137e-07 1.674585e-03
#6 0.0038247856 8.417320e-08 1.459670e-04

sessionInfo()
# R version 3.2.2 (2015-08-14)
# Platform: x86_64-apple-darwin13.4.0 (64-bit)
# Running under: OS X 10.11 (El Capitan)
#
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base
#
# other attached packages:
# [1] limma_3.24.15        BayesFactor_0.9.12-2 Matrix_1.2-2         coda_0.17-1
#
# loaded via a namespace (and not attached):
#  [1] magrittr_1.5       tools_3.2.2        pbapply_1.1-1      MatrixModels_0.4-1 Rcpp_0.12.1
#  [6] mvtnorm_1.0-3      stringi_0.5-5      grid_3.2.2         stringr_1.0.0      gtools_3.5.0
# [11] lattice_0.20-33
modified 3.9 years ago by Aaron Lun24k • written 3.9 years ago by Andy9160
8
3.9 years ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

You're forgetting that all samples are used to estimate the variance of each gene when you use the full model in design. When you perform the contrasts between any two groups, the samples in the remaining group will still contribute to the calculations by affecting the estimated variance (and thus, t-statistics and p-values). However, once you subset the samples to perform each pairwise contrast, the samples in the remaining group will not be used. This results in a different variance estimate, t-statistic, p-value, etc. for each gene.

In general, it is recommended to use all samples and then makeContrasts to specify the comparisons, rather than manually partitioning the data set. Most sequencing data sets have very few residual d.f. for variance estimation, so it's important to use as many samples as possible to get reliable inferences.

Thank you for your answer. I am still trying to wrap my head around this concept. I am still struggling with the idea of why "trt3"  influences the estimated variance of "trt1" with "trt2"? Is this similar to the question described in the link below?

https://stat.ethz.ch/pipermail/bioconductor/2012-August/047785.html

I am relatively new to statistics, which might explain my seeming ignorance with regard to this issue. Under what circumstance could you imagine that manually partitioning the data would be more favorable to generating contrasts?

2

The variance for each gene is not estimated for one group relative to another. Rather, it is estimated from the entire design, using information from all groups. No matter what DE comparisons you perform, you'll be using the same variance estimate. So, the behaviour of trt3 will affect a DE comparison between the other two groups.

As for your other question; the most obvious motivation for partitioning would be if some groups had unequal variances, such that you wouldn't want to analyze them together for fear of getting an inappropriate variance estimate. However, this effect can probably be ignored -  see A: Correct assumptions of using limma moderated t-test for a good explanation. There may well be other scenarios where partitioning is appropriate, but yours is not one of them.