I've run into a problem with limma where I get different results (different p-values and # of DE genes) depending on whether or not I include an intercept term in my model. This only happens when I include one or more QC metric as a term in the model, and not when my contrast variable is the only term in the model.
For context, I'm trying to find DE genes between stimulated and un-stimulated cells while correcting for a major batch effect (here called "Date") as well as a quality metric that contributes a large amount of variability to the data ("Med_CV_COV").
These are the two models in question:
> Design_noInt <- model.matrix(~0 + annotation_data$`IL-6 Stim (ng/mL)` + annotation_data$`Received Date` + metrics_data$MEDIAN_CV_COVERAGE) > Design_withInt <- model.matrix(~annotation_data$`IL-6 Stim (ng/mL)` + annotation_data$`Received Date` + metrics_data$MEDIAN_CV_COVERAGE)> colnames(Design_noInt) <- c("NoStim","Stim","Date","Med_CV_COV") > colnames(Design_withInt) <- c("Intercept","Stim","Date","Med_CV_COV")> head(Design_noInt) NoStim Stim Date Med_CV_COV 1 1 0 0 0.414025 2 0 1 1 0.509716 3 1 0 0 0.432453 4 0 1 0 0.437015 5 0 1 0 0.429761 6 0 1 1 0.430351 > head(Design_withInt) Intercept Stim Date Med_CV_COV 1 1 0 0 0.414025 2 1 1 1 0.509716 3 1 0 0 0.432453 4 1 1 0 0.437015 5 1 1 0 0.429761 6 1 1 1 0.430351
I use makeContrasts to build the contrast matrices:
> contrast_matrix_noInt <- makeContrasts(Stim-NoStim,levels=Design_noInt)
> contrast_matrix_withInt <- makeContrasts(Stim,levels=Design_withInt)
> head(contrast_matrix_noInt)
Contrasts
Levels Stim - NoStim
NoStim -1
Stim 1
Date 0
Med_CV_COV 0
> head(contrast_matrix_withInt)
Contrasts
Levels Stim
Intercept 0
Stim 1
Date 0
Med_CV_COV 0
The issue appears to begin when I run voomWithQualityWeights. The output values from each model are different starting 7 or more decimal places out. (I also see this difference in output values if I run voom instead of voomWithQualityWeights.)
> DesignVoom_withInt <- voomWithQualityWeights(counts_data, design = Design_withInt,plot = FALSE)
> DesignVoom_noInt <- voomWithQualityWeights(counts_data, design = Design_noInt,plot = FALSE)
> head(DesignVoom_noInt$targets)
group lib.size norm.factors sample.weights
lib25261 1 1841838.883137691 1.030479765643349 0.656581739063742
lib25227 1 699929.217502813 0.854009889095896 1.165866208133710
lib25201 1 1332381.279544953 1.005668388291176 0.839094886216758
lib25099 1 1112919.660542886 0.974510541975693 1.468468932781500
lib25107 1 1315157.108746946 0.993440363239193 1.264011815526400
lib25135 1 1089059.073096440 0.902813970098498 0.741124578697799
> head(DesignVoom_withInt$targets)
group lib.size norm.factors sample.weights
lib25261 1 1841838.883137691 1.030479765643349 0.656581739063740
lib25227 1 699929.217502813 0.854009889095896 1.165866208133712
lib25201 1 1332381.279544953 1.005668388291176 0.839094886216759
lib25099 1 1112919.660542886 0.974510541975693 1.468468932781501
lib25107 1 1315157.108746946 0.993440363239193 1.264011815526400
lib25135 1 1089059.073096440 0.902813970098498 0.741124578697799
The very small differences between the models are also apparent in the coefficients from lmFit. If I don't include the model matrix in my voom or voomWithQualityWeights call, my normalized counts are identical for both models, but I still get different coefficients from lmFit. So, whatever is causing the difference is happening in both functions.
> dataFit_original_withInt <- lmFit(DesignVoom_withInt, Design_withInt)
> dataFit_original_noInt <- lmFit(DesignVoom_noInt, Design_noInt)
> head(dataFit_original_noInt$coefficients)
NoStim Stim Date Med_CV_COV
A1BG -5.14414498296569 -5.13122961153188 -0.0360652274290323 9.45002714460223
A2M -9.11077944211848 -9.87110820204270 -3.0511158414800814 27.88156253194389
AAAS 10.18738855678886 10.38215926349661 -0.0660609150940764 -11.09227736364429
AACS 3.02044625793081 2.72681124153659 -0.5158485696146826 3.12176076854942
AAED1 7.28469712246058 7.37038888310042 0.6860832924584279 -1.06582376856191
AAGAB 8.77255273811071 8.82868405242133 -1.4424142811213769 -3.16404426359956
> head(dataFit_original_withInt$coefficients)
Intercept Stim Date Med_CV_COV
A1BG -5.14414498296570 0.0129153714338110 -0.0360652274290349 9.45002714460226
A2M -9.11077944211846 -0.7603287599242156 -3.0511158414800765 27.88156253194384
AAAS 10.18738855678885 0.1947707067077467 -0.0660609150940770 -11.09227736364427
AACS 3.02044625793081 -0.2936350163942217 -0.5158485696146839 3.12176076854943
AAED1 7.28469712246058 0.0856917606398358 0.6860832924584255 -1.06582376856189
AAGAB 8.77255273811074 0.0561313143106106 -1.4424142811213743 -3.16404426359961
In the documentation for contrasts.fit, there's a warning that the function doesn't re-factorize the design matrix for every gene if there are quality weights, and instead uses the initial approximation for every gene. However, I don't think this is causing the discrepancy here since it happens whether or not I call voom or voomWithQualityWeights. I think there could be some approximation happening in lmfit that is dependent on the model structure. Since voom internally calls lmfit, that might explain why the small differences occur at both the voom call and the subsequent model fitting calls (?).
The rest of my analysis calls contrasts.fit, eBayes, and topTable, followed by filtering for genes that meet my thresholds:
dataFit_contrasts_withInt <- contrasts.fit(dataFit_original_withInt, contrasts = contrast_matrix_withInt)
dataFit_contrasts_noInt <- contrasts.fit(dataFit_original_noInt, contrasts = contrast_matrix_noInt)
### With intercept
dataFit_withInt <- eBayes(dataFit_contrasts_withInt)
dataResults_withInt <- topTable(dataFit_withInt, number = nrow(counts_data), coef = "Stim")
sig_genes_pval_withInt <- dataResults_withInt[c(dataResults_withInt$adj.P.Val < 0.05),]
sig_genes_withInt <- sig_genes_pval_withInt[c(abs(sig_genes_pval_withInt$logFC) > log2(1.5)),]
### No intercept
dataFit_noInt <- eBayes(dataFit_contrasts_noInt)
dataResults_noInt <- topTable(dataFit_noInt, number = nrow(counts_data), coef="Stim - NoStim")
sig_genes_pval_noInt <- dataResults_noInt[c(dataResults_noInt$adj.P.Val < 0.05),]
sig_genes_noInt <- sig_genes_pval_noInt[c(abs(sig_genes_pval_noInt$logFC) > log2(1.5)),]
> nrow(sig_genes_noInt)
[1] 83
> nrow(sig_genes_withInt)
[1] 104
The result is 104 DE genes if I include an intercept, and 83 if I don't.
Does anyone know what might be going on here?
I'm running limma version 3.34.9 with R version 3.4.4.
