I have multiple two-color microarray experiments, analyzed with the same technology, that are connected through common genotypes. I want to compute best linear unbiased estimates (BLUEs)/adjusted values for each RNA source across all experiments. I have tried a combined analysis in limma in a single step and a two-step approach using a linear mixed model package (lme4
in R). The resulting BLUEs are highly correlated, but their estimated standard deviations of the errors are only moderately correlated. How can this be explained?
Full description and question
The dataset that I would like to analyze in limma
has the following properties:
- Seven separate experiments, which share a common chip and image-reading technology.
- All RNA sources across experiments are biological replicates.
- All RNA sources that occur mutiple times on the same array are technical replicates.
- The different experiments are connected via some common genotypes, but not via common arrays.
- The hybridization pairs were selected in a way that ensures that all RNA sources are connected (interwoven loop design).
- A two-color design was used, where, for each experiment, every genotype was labeled once with "Cy3" and once with "Cy5", respectively.
Our goal is to compute best linear unbiased estimates (BLUE)/adjusted values of the gene expressions for each RNA source. We are NOT interested in the differences of gene expressions between the RNA sources as it is usually the case. I see two options to achieve this goal of ours:
1. A one-step approach where all arrays from every experiment are included in the limma
analysis. In order to account for the existence of multiple experiments, an "Experiment"-effect is added by augmenting the design matrix by the number of experiments.
rnasources <- uniqueTargets(targets) design <- modelMatrix(targets, ref = rnasources[1]) fact_mod <- model.matrix(~0 + Experiment, dd) design <- cbind(design, fact_mod)
, where dd
is a data frame that contains the levels of the arrays dependent on their experiment membership.
2. A two-step approach where each experiment is analyzed separately, the BLUEs are extracted from the individual experiment and enter a linear mixed model with an "Experiment" effect (for example lme4
or asreml
).
I have investigated both analyses and there is a very high correlation (r = 0.97) between the LS-means from the one-stage and the two-stage analysis, respectively. However, the correlation between the estimated standard deviations of the errors (sigma
both in lme4
and limma
) is merely 0.7. Now I am wondering about the reason for those differences. Interestingly, the correlation between the best linear unbiased predictors (BLUP) of the random "Experiment" effect in the two-stage model and the best linear unbiased estimates (BLUE) of the fixed "Experiment" effect in the one-stage model is close to zero, which I did not expect in advance. Some assumptions:
- The model matrix for the one-stage analysis is incorrect. I have read in other posts that it is fine to augment the design matrix by adding different levels of a "Batch" effect. I would consider this case to be comparable to mine, so I do not immediately see a problem here, but please correct me if I am wrong.
- The precision of the sigmas is considerably higher for the one-stage approach compared to the two-stage approach, since there is simply more information available for the estimation of the BLUEs.
I assume that you're referring to the BLUEs of the coefficients when you're talking about "RNA sources". The best estimate of the observed expression value for each sample would be, well, the observed expression value.
Yes, I am referring to the log-ratio between a given RNA source and my chosen reference for each gene.