Question: Results from variancePartition not matching those from lme4
0
7 weeks ago by
c.taylor0
c.taylor0 wrote:

I have been getting some slightly strange results from the variancePartition package. This prompted me to dig deeper by looking at the output of some of the individual mixed models fitted within. However, the results from fitVarPartModel() do not match those that I get from running what I think is the same model using lmer().

Presumably there is some option/argument that is set differently within variancePartition but I can't figure it out. See example below, based on one given in the variancePartition vignette. Note that some elements of the output are very similar or identical, but for example the residual standard deviation is very different between the two methods. What is different about the two models that I'm fitting that is leading to the different outputs?

# Prepare data
library(variancePartition)
data(varPartData)
weights <- matrix(runif(length(geneExpr)), nrow = nrow(geneExpr))

# Fit model on first two rows of data using variancePartition (get an error if I try to fit a single row)
res <- fitVarPartModel(geneExpr[1:2,], ~ (1|Individual) + (1|Tissue) + Age + Height, info,
weightsMatrix = weights[1:2,])
# Fit model on first row of data using lme4
resLME <- lmer(geneExpr[1, ] ~ (1|Individual) + (1|Tissue) + Age + Height,
data = info, weights = weights[1,], REML = F)

# Compare outputs
summary(res[[1]])$varcor summary(resLME)$varcor


Session info:

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

Random number generation:
RNG:     Mersenne-Twister
Normal:  Inversion
Sample:  Rounding

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] splines   parallel  stats     graphics  grDevices utils     datasets  methods
[9] base

other attached packages:
[1] lme4_1.1-21              Matrix_1.2-17            variancePartition_1.14.0
[4] Biobase_2.44.0           BiocGenerics_0.30.0      scales_1.0.0
[7] foreach_1.4.7            limma_3.40.4             ggplot2_3.2.1

loaded via a namespace (and not attached):
[1] progress_1.2.2     gtools_3.8.1       tidyselect_0.2.5   purrr_0.3.2
[5] reshape2_1.4.3     lattice_0.20-38    colorspace_1.4-1   vctrs_0.2.0
[9] rlang_0.4.0        pillar_1.4.2       nloptr_1.2.1       glue_1.3.1
[13] withr_2.1.2        plyr_1.8.4         stringr_1.4.0      munsell_0.5.0
[17] gtable_0.3.0       caTools_1.17.1.2   codetools_0.2-16   doParallel_1.0.15
[21] pbkrtest_0.4-7     Rcpp_1.0.2         KernSmooth_2.23-15 backports_1.1.4
[25] gdata_2.18.0       gplots_3.0.1.1     hms_0.5.1          stringi_1.4.3
[29] dplyr_0.8.3        grid_3.6.1         tools_3.6.1        bitops_1.0-6
[33] magrittr_1.5       lazyeval_0.2.2     tibble_2.1.3       crayon_1.3.4
[37] pkgconfig_2.0.2    zeallot_0.1.0      MASS_7.3-51.4      prettyunits_1.0.2
[41] assertthat_0.2.1   minqa_1.2.4        rstudioapi_0.10    iterators_1.0.12
[45] R6_2.4.0           boot_1.3-23        colorRamps_2.3     nlme_3.1-141
[49] compiler_3.6.1

modified 18 days ago by gabriel.hoffman80 • written 7 weeks ago by c.taylor0

I will test this out and get back to you this week. I have fixed a few bugs in my development version, so its possible that this resolved in the latest version I'm planning to post next week.

Cheers, Gabriel

Great, thanks for taking a look!

Answer: Results from variancePartition not matching those from lme4
2
6 weeks ago by
United States
gabriel.hoffman80 wrote:

That is certainly not ideal. The strange result is due to the fact that linear mixed model are solved with iterative algorithms that depend on:

1. the starting point of the parameters values and
2. the convergence criteria

variancePartition fits the model on the first gene and then uses this set of parameter values to initialize the values for subsequent genes. This can speed up convergence. In your case, the initial values are so close to zero, that they cause the iteration for subsequent genes to stop too early before converging to the correct value.

There are two solutions:

1. Don't use the values of the first gene for initialization. I will look into changing this.
2. Use a different algorithm / convergence criterion. Here, change the optimizer to 'Nelder_Mead'

cntrl = lme4::lmerControl(optimizer = 'Nelder_Mead', calc.derivs=FALSE, check.rankX="stop.deficient") fitVarPartModel(..., control=cntrl)

This should also work for fitExtractVarPartModel() and dream()

Hi Gabriel, thanks for this reply. This does indeed seem to be the problem. I have tried changing the optimizer as you suggest and this has removed the strange results that I was getting. As you say, it does run considerably slower (about 2.5x on my dataset) but the results are consistent. Many thanks for your help.

Hi, With interest I have read this thread, because some of our transcriptome experiments have a complex design for which analysis by linear mixed models would be useful. However, considering the fact that the lme4 and variancePartition results apparently don’t differ (actually, are in agreement) I got confused… Apologies if I overlooked something obvious, but I understood that variancePartition somehow utilizes the empirical Bayes moderated t-test from limma (function eBayes()), whereas lme4 doesn’t do this. So why should the results of variancePartition and lme4 be identical? As a side-note, the use of lme4 for RNA-seq analysis has been discussed previously on this list (here). Thanks, Guido

Answer: Results from variancePartition not matching those from lme4
1
7 weeks ago by
United States
gabriel.hoffman80 wrote:

Hi c.taylor, Thanks for your question. It turns out that the one difference is the estimated value of the residual variance. First I will talk about estimiating this terms on linear models before extending to linear mixed models.

# Start with a simple linear model
# set seed for reproducability
set.seed(1)

# draw random response
y = rnorm(1000)

# draw random weights
# multiply by 10 to exaggerate the difference in scale
weights <- 10*runif(length(y))

# scale weights so that each row (i.e. gene)
# has a mean of 1 to retain the original scaling
weights_sc = weights / mean(weights)

# fit linear model with weights
fit1 = lm(y ~ 1,  weights = weights)

# fit linear model with scaled weights
fit2 = lm(y ~ 1, weights = weights_sc)

# note that all the values of coefficients, t-stats, etc
# are identical in the two cases.
# They are still identical with covariates
coef(summary(fit1))
coef(summary(fit1))

# also the variance-covariance of the coefficients are the same
vcov(fit1)
vcov(fit2)

# Moreover, direct computation of the sd of residual is identical
sd(fit1$residuals) sd(fit2$residuals)

# the only thing that is different is the residual variance and sd.
# Compute the sd of the response
sd(y)

# now compute the residual sd from the two model fits
sigma(fit1)
sigma(fit2)


We see that fit2 (with scaled weights) gives a very similar answer to the unweighted sd(y). But fit1 is way off! This is because the scale of the weights is affecting the scale of the residual sd. Note that this is not generally an issue since the vcov() matrix automatically acounts for the diffence in scale. This issue only arises when we want the residual sd and variance.

How much is sigma(fit1) off by? Multiplying weights by a scalar shows that it is off by about a factor of sqrt(mean(weights)).

So how to we deal with this? Just scale the weights for each response to have a mean of 1 to retain the original scale.

Ok, so back to variancePartition.

library(variancePartition)
library(lme4)
library(limma)
data(varPartData)

# draw weights
# multiply by 10 to exaggerate the difference in scale
weights <- 10 * matrix(runif(length(geneExpr)), nrow = nrow(geneExpr))

# fit linear mixed model with variancePartition
res <- fitVarPartModel(geneExpr[1,,drop=FALSE], ~ (1|Individual) + (1|Tissue) + Age + Height, info, weightsMatrix = weights[1,,drop=FALSE])

# fit linear mixed model with lme4 directly
# Scale weights to have a mean of 1
resLME <- lmer(geneExpr[1, ] ~ (1|Individual) + (1|Tissue) + Age + Height,
data = info, weights = weights[1,] / mean(weights[1,]), REML = FALSE)

# Compare outputs
summary(res[[1]])$varcor summary(resLME)$varcor


Since we scaled the weights, the values are now the same. There might be a small difference in the 5th decimal place, but that is due to a small change in the convergence criteria for the iterative algorithm.

Many thanks for the response. I tested out your suggestion on my data, and it certainly explained some of the discrepancies I was seeing but not all, occasionally there was a difference cropping up that was not to do with the residual variance. I therefore did some digging and managed to narrow down far enough to get a clearer reproducible example. It turns out that depending on where you start in the list of genes, you get different results even for the same gene. This difference is normally very small but occasionally is massive.

Please see code example below. Here I tried 8 different starting indices, which takes a minute or so in total to run. The run that starts at index 151 gives a very different output to the other 7 runs. I have no idea if I got "lucky" here and therefore how commonly this occurs, except to say that in my own data with ~1000 genes there were two results that stood out to me as unusual that triggered my initial exploration.

I'm guessing this has something to do with random seed values? But why is the effect usually tiny and occasionally massive?

library(variancePartition)

data(varPartData)

# fit model for each gene
form <- ~ Age + (1|Individual) + (1|Tissue)
results <- fitVarPartModel( geneExpr, form, info )

# Try starting at different points in the dataset
startvals <- seq(1, 200, by = 25)
all.res <- list(length(startvals))

for (k in 1:length(startvals)) {
results.tmp <- fitVarPartModel(geneExpr[startvals[k]:200,,drop = F], form, info)[[200-startvals[k]+1]]
all.res[[k]] <- results.tmp
}

# Examine results across runs: variance estimates
lapply(all.res, function(x) summary(x)$varcor) # Results across runs: AIC values sapply(all.res, AIC)  Output: > lapply(all.res, function(x) summary(x)$varcor)
[[1]]
Groups     Name        Std.Dev.
Individual (Intercept) 3.09686
Tissue     (Intercept) 0.89371
Residual               1.00167

[[2]]
Groups     Name        Std.Dev.
Individual (Intercept) 3.09663
Tissue     (Intercept) 0.89389
Residual               1.00169

[[3]]
Groups     Name        Std.Dev.
Individual (Intercept) 3.09655
Tissue     (Intercept) 0.89344
Residual               1.00171

[[4]]
Groups     Name        Std.Dev.
Individual (Intercept) 3.09661
Tissue     (Intercept) 0.89372
Residual               1.00170

[[5]]
Groups     Name        Std.Dev.
Individual (Intercept) 3.09660
Tissue     (Intercept) 0.89374
Residual               1.00170

[[6]]
Groups     Name        Std.Dev.
Individual (Intercept) 3.0966
Tissue     (Intercept) 0.8938
Residual               1.0017

[[7]]
Groups     Name        Std.Dev.
Individual (Intercept) 3.2145e+00
Tissue     (Intercept) 3.6886e-05
Residual               1.2491e+00

[[8]]
Groups     Name        Std.Dev.
Individual (Intercept) 3.09663
Tissue     (Intercept) 0.89373
Residual               1.00170

> sapply(all.res, AIC)
[1] 392.2758 392.2758 392.2758 392.2758 392.2758 392.2758 421.1189 392.2758

Answer: Results from variancePartition not matching those from lme4
0
18 days ago by
United States
gabriel.hoffman80 wrote:

Hi c.taylor, Thanks for your bug report. I have fixed the issue in the latest release 1.14.1. Let me know if you find any other issues

-- Gabriel