Question

Results from variancePartition not matching those from lme4

0

Entering edit mode

c.taylor • 0

@ctaylor-21442

Last seen 4.7 years ago

I have been getting some slightly strange results from the variancePartition package. This prompted me to dig deeper by looking at the output of some of the individual mixed models fitted within. However, the results from fitVarPartModel() do not match those that I get from running what I think is the same model using lmer().

Presumably there is some option/argument that is set differently within variancePartition but I can't figure it out. See example below, based on one given in the variancePartition vignette. Note that some elements of the output are very similar or identical, but for example the residual standard deviation is very different between the two methods. What is different about the two models that I'm fitting that is leading to the different outputs?

# Prepare data
library(variancePartition)
data(varPartData)
weights <- matrix(runif(length(geneExpr)), nrow = nrow(geneExpr))

# Fit model on first two rows of data using variancePartition (get an error if I try to fit a single row)
res <- fitVarPartModel(geneExpr[1:2,], ~ (1|Individual) + (1|Tissue) + Age + Height, info,
                       weightsMatrix = weights[1:2,])
# Fit model on first row of data using lme4
resLME <- lmer(geneExpr[1, ] ~ (1|Individual) + (1|Tissue) + Age + Height, 
                   data = info, weights = weights[1,], REML = F)

# Compare outputs
summary(res[[1]])$varcor
summary(resLME)$varcor

Session info:

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] splines   parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
[1] lme4_1.1-21              Matrix_1.2-17            variancePartition_1.14.0
[4] Biobase_2.44.0           BiocGenerics_0.30.0      scales_1.0.0            
[7] foreach_1.4.7            limma_3.40.4             ggplot2_3.2.1           

loaded via a namespace (and not attached):
 [1] progress_1.2.2     gtools_3.8.1       tidyselect_0.2.5   purrr_0.3.2       
 [5] reshape2_1.4.3     lattice_0.20-38    colorspace_1.4-1   vctrs_0.2.0       
 [9] rlang_0.4.0        pillar_1.4.2       nloptr_1.2.1       glue_1.3.1        
[13] withr_2.1.2        plyr_1.8.4         stringr_1.4.0      munsell_0.5.0     
[17] gtable_0.3.0       caTools_1.17.1.2   codetools_0.2-16   doParallel_1.0.15 
[21] pbkrtest_0.4-7     Rcpp_1.0.2         KernSmooth_2.23-15 backports_1.1.4   
[25] gdata_2.18.0       gplots_3.0.1.1     hms_0.5.1          stringi_1.4.3     
[29] dplyr_0.8.3        grid_3.6.1         tools_3.6.1        bitops_1.0-6      
[33] magrittr_1.5       lazyeval_0.2.2     tibble_2.1.3       crayon_1.3.4      
[37] pkgconfig_2.0.2    zeallot_0.1.0      MASS_7.3-51.4      prettyunits_1.0.2 
[41] assertthat_0.2.1   minqa_1.2.4        rstudioapi_0.10    iterators_1.0.12  
[45] R6_2.4.0           boot_1.3-23        colorRamps_2.3     nlme_3.1-141      
[49] compiler_3.6.1

variancepartition lme4 mixedmodel • 1.5k views

ADD COMMENT • link updated 4.6 years ago by gabriel.hoffman ▴ 110 • written 4.7 years ago by c.taylor • 0

0

Entering edit mode

I will test this out and get back to you this week. I have fixed a few bugs in my development version, so its possible that this resolved in the latest version I'm planning to post next week.

Cheers, Gabriel

ADD REPLY • link 4.7 years ago gabriel.hoffman ▴ 110

0

Entering edit mode

Great, thanks for taking a look!

ADD REPLY • link 4.7 years ago c.taylor • 0

1

Entering edit mode

gabriel.hoffman ▴ 110

@gabrielhoffman-8391

Last seen 12 days ago

United States

Hi c.taylor, Thanks for your question. It turns out that the one difference is the estimated value of the residual variance. First I will talk about estimiating this terms on linear models before extending to linear mixed models.

# Start with a simple linear model
# set seed for reproducability
set.seed(1)

# draw random response
y = rnorm(1000)

# draw random weights
# multiply by 10 to exaggerate the difference in scale
weights <- 10*runif(length(y))

# scale weights so that each row (i.e. gene)
# has a mean of 1 to retain the original scaling
weights_sc = weights / mean(weights)

# fit linear model with weights
fit1 = lm(y ~ 1,  weights = weights)

# fit linear model with scaled weights
fit2 = lm(y ~ 1, weights = weights_sc)

# note that all the values of coefficients, t-stats, etc
# are identical in the two cases. 
# They are still identical with covariates  
coef(summary(fit1))
coef(summary(fit1))

# also the variance-covariance of the coefficients are the same
vcov(fit1)
vcov(fit2)

# Moreover, direct computation of the sd of residual is identical
sd(fit1$residuals)
sd(fit2$residuals)

# the only thing that is different is the residual variance and sd.
# Compute the sd of the response
sd(y)

# now compute the residual sd from the two model fits
sigma(fit1)
sigma(fit2)

We see that fit2 (with scaled weights) gives a very similar answer to the unweighted sd(y). But fit1 is way off! This is because the scale of the weights is affecting the scale of the residual sd. Note that this is not generally an issue since the vcov() matrix automatically acounts for the diffence in scale. This issue only arises when we want the residual sd and variance.

How much is sigma(fit1) off by? Multiplying weights by a scalar shows that it is off by about a factor of sqrt(mean(weights)).

So how to we deal with this? Just scale the weights for each response to have a mean of 1 to retain the original scale.

Ok, so back to variancePartition.

library(variancePartition)
library(lme4)
library(limma)
data(varPartData)

# draw weights
# multiply by 10 to exaggerate the difference in scale
weights <- 10 * matrix(runif(length(geneExpr)), nrow = nrow(geneExpr))

# fit linear mixed model with variancePartition
res <- fitVarPartModel(geneExpr[1,,drop=FALSE], ~ (1|Individual) + (1|Tissue) + Age + Height, info, weightsMatrix = weights[1,,drop=FALSE])

# fit linear mixed model with lme4 directly
# Scale weights to have a mean of 1
resLME <- lmer(geneExpr[1, ] ~ (1|Individual) + (1|Tissue) + Age + Height, 
                   data = info, weights = weights[1,] / mean(weights[1,]), REML = FALSE)

# Compare outputs
summary(res[[1]])$varcor
summary(resLME)$varcor

Since we scaled the weights, the values are now the same. There might be a small difference in the 5th decimal place, but that is due to a small change in the convergence criteria for the iterative algorithm.

ADD COMMENT • link 4.7 years ago gabriel.hoffman ▴ 110

0

Entering edit mode

Many thanks for the response. I tested out your suggestion on my data, and it certainly explained some of the discrepancies I was seeing but not all, occasionally there was a difference cropping up that was not to do with the residual variance. I therefore did some digging and managed to narrow down far enough to get a clearer reproducible example. It turns out that depending on where you start in the list of genes, you get different results even for the same gene. This difference is normally very small but occasionally is massive.

Please see code example below. Here I tried 8 different starting indices, which takes a minute or so in total to run. The run that starts at index 151 gives a very different output to the other 7 runs. I have no idea if I got "lucky" here and therefore how commonly this occurs, except to say that in my own data with ~1000 genes there were two results that stood out to me as unusual that triggered my initial exploration.

I'm guessing this has something to do with random seed values? But why is the effect usually tiny and occasionally massive?

library(variancePartition)

# load simulated data
data(varPartData)

# fit model for each gene
form <- ~ Age + (1|Individual) + (1|Tissue) 
results <- fitVarPartModel( geneExpr, form, info )

# Try starting at different points in the dataset
startvals <- seq(1, 200, by = 25)
all.res <- list(length(startvals))

for (k in 1:length(startvals)) {
  results.tmp <- fitVarPartModel(geneExpr[startvals[k]:200,,drop = F], form, info)[[200-startvals[k]+1]]
  all.res[[k]] <- results.tmp
}

# Examine results across runs: variance estimates
lapply(all.res, function(x) summary(x)$varcor)

# Results across runs: AIC values
sapply(all.res, AIC)

Output:

> lapply(all.res, function(x) summary(x)$varcor)
[[1]]
 Groups     Name        Std.Dev.
 Individual (Intercept) 3.09686 
 Tissue     (Intercept) 0.89371 
 Residual               1.00167 

[[2]]
 Groups     Name        Std.Dev.
 Individual (Intercept) 3.09663 
 Tissue     (Intercept) 0.89389 
 Residual               1.00169 

[[3]]
 Groups     Name        Std.Dev.
 Individual (Intercept) 3.09655 
 Tissue     (Intercept) 0.89344 
 Residual               1.00171 

[[4]]
 Groups     Name        Std.Dev.
 Individual (Intercept) 3.09661 
 Tissue     (Intercept) 0.89372 
 Residual               1.00170 

[[5]]
 Groups     Name        Std.Dev.
 Individual (Intercept) 3.09660 
 Tissue     (Intercept) 0.89374 
 Residual               1.00170 

[[6]]
 Groups     Name        Std.Dev.
 Individual (Intercept) 3.0966  
 Tissue     (Intercept) 0.8938  
 Residual               1.0017  

[[7]]
 Groups     Name        Std.Dev.  
 Individual (Intercept) 3.2145e+00
 Tissue     (Intercept) 3.6886e-05
 Residual               1.2491e+00

[[8]]
 Groups     Name        Std.Dev.
 Individual (Intercept) 3.09663 
 Tissue     (Intercept) 0.89373 
 Residual               1.00170 

> sapply(all.res, AIC)
[1] 392.2758 392.2758 392.2758 392.2758 392.2758 392.2758 421.1189 392.2758

ADD REPLY • link 4.7 years ago c.taylor • 0

0

Entering edit mode

gabriel.hoffman ▴ 110

@gabrielhoffman-8391

Last seen 12 days ago

United States

Hi c.taylor, Thanks for your bug report. I have fixed the issue in the latest release 1.14.1. Let me know if you find any other issues

-- Gabriel

ADD COMMENT • link 4.6 years ago gabriel.hoffman ▴ 110

score 2 · Accepted Answer · 2019-09-05

2

Entering edit mode

gabriel.hoffman ▴ 110

@gabrielhoffman-8391

Last seen 12 days ago

United States

That is certainly not ideal. The strange result is due to the fact that linear mixed model are solved with iterative algorithms that depend on:

the starting point of the parameters values and
the convergence criteria

variancePartition fits the model on the first gene and then uses this set of parameter values to initialize the values for subsequent genes. This can speed up convergence. In your case, the initial values are so close to zero, that they cause the iteration for subsequent genes to stop too early before converging to the correct value.

There are two solutions:

Don't use the values of the first gene for initialization. I will look into changing this.
Use a different algorithm / convergence criterion. Here, change the optimizer to 'Nelder_Mead'

cntrl = lme4::lmerControl(optimizer = 'Nelder_Mead', calc.derivs=FALSE, check.rankX="stop.deficient") fitVarPartModel(..., control=cntrl)

This should also work for fitExtractVarPartModel() and dream()

ADD COMMENT • link 4.7 years ago gabriel.hoffman ▴ 110

0

Entering edit mode

Hi Gabriel, thanks for this reply. This does indeed seem to be the problem. I have tried changing the optimizer as you suggest and this has removed the strange results that I was getting. As you say, it does run considerably slower (about 2.5x on my dataset) but the results are consistent. Many thanks for your help.

ADD REPLY • link 4.7 years ago c.taylor • 0

0

Entering edit mode

Hi, With interest I have read this thread, because some of our transcriptome experiments have a complex design for which analysis by linear mixed models would be useful. However, considering the fact that the lme4 and variancePartition results apparently don’t differ (actually, are in agreement) I got confused… Apologies if I overlooked something obvious, but I understood that variancePartition somehow utilizes the empirical Bayes moderated t-test from limma (function eBayes()), whereas lme4 doesn’t do this. So why should the results of variancePartition and lme4 be identical? As a side-note, the use of lme4 for RNA-seq analysis has been discussed previously on this list (here). Thanks, Guido

ADD REPLY • link 4.6 years ago Guido Hooiveld ★ 3.9k