Dear Gordon,
Many thanks for your message! Of course, I agree with you that mixed
model theory as such is pretty complicated and still to some extent in
development. But what I find easier is the specification part of the
model, where all you have to do is specify the formula that captures
the dependencies/sources of correlation in the data and the fixed
factors plus possible interactions that you want to fit, as opposed to
the design matrix in limma, which (at least I sometimes find) tricky
to specify for more complex multifactorial designs (even with
functions designMatrix() and contrast.fit()). Also, the new R packages
lmerTest and lsmeans would make a mixed model approach much easier
than before, as you can now easily get the desired tests and log2
ratio contrasts and lsmeans (plus confidence limits) for the different
treatment levels out of the lme4 fit (as you could in SAS before
already for a long time, but alas not in R). I am not saying that all
this would be trivial, but it could be made quite easy with a
dedicated package I believe.
The method would indeed analyze two-colour designs as single-colour
ones, and take into account the pairing of R and G labels on the same
array using a random factor. The difference though would be that many
sources of dependency/correlation could readily be taken into account,
e.g. repeated use of the same sample due to dye swaps and technical
replication, replicate spots on the same slide, repeated use of
samples with the same genetic background, etc etc... Maanova does
indeed allow for this approach to some extent, although I believe lme4
combined with lsmeans and lmerTest would be much faster and would
allow for more complex designs (e.g. with hierarchically nested random
factors, multiple crossed random factors, random slopes models, etc,
the SAS example e.g.,
http://support.sas.com/documentation/cdl/en/stat
ug/63033/HTML/default/viewer.htm#statug_hpmixed_sect035.htm, could not
be fit using Maanova I believe, whereas I could do it no probs with
lme4). And the possible generalization to RNAseq data I think would be
very interesting. Maybe I'll start doing some simulations to
systematically compare the statistical power of these different
approaches for some different experimental designs (many thanks also
for the Altman paper, in my case I was surprised to find though that I
seemingly got a better statistical power even with a simple single-
factor (2 group) 2-colour randomized block design analysed with a
mixed model, for which the Altman paper claims that the power of the
different approaches should be comparable).
Oh yes, and was meaning RGList of course :-)
Well, I'll keep you posted if I make some further progress with this -
and let me know if you would be interested in co-developing or helping
out in some way with such a package!
Cheers,
Tom
-----Original Message-----
From: Gordon K Smyth [mailto:smyth@wehi.EDU.AU]
Sent: 30 June 2013 05:42
To: Tom Wenseleers
Cc: Bioconductor mailing list
Subject: Statistical power of limma vs mixed model approach to analyze
microarray and differentialproteomics/peptidomics experiments and
Poisson/binomial mixed models for RNAseq data
Dear Tom,
I must say that this is first time I have heard anyone suggest that a
mixed linear model is easier or simpler than an ordinary linear model.
Having used the lme4 package myself, the opposite seems true to me by
a very long way.
Do you know that mixed models have been proposed and implemented
several times for microarray data? For example the maanova package or
the
lmscFit() separate channel approach of the limma package. A recent
paper by Naomi Altman and myself gives a comparison of some of the
approaches:
http://www.biomedcentral.com/1471-2105/14/165
For some designs, a mixed model or separate approach can indeed
recover more information than a classic log-ratio analysis. I don't
fully follow the approach that you describe, but it sounds to have
some similarities with the limma separate channel approach as well as
with maanova.
limma can be generalized to RNA-seq, see
http://www.statsci.org/smyth/pubs/VoomPreprint.pdf
limma does allow for a treatment of randomized block designs but,
apart from this exception, it is certainly true that RNA-seq analysis
methods do not allow random effect models. So there is a gap in the
market if you are keen to fill it.
There isn't any data object class called RGLimma that I know of. Do
you mean RGList?
Best wishes
Gordon
> Date: Sat, 29 Jun 2013 02:49:53 +0000
> From: Tom Wenseleers <tom.wenseleers at="" bio.kuleuven.be="">
> To: "bioconductor at r-project.org" <bioconductor at="" r-project.org="">
> Subject: [BioC] Statistical power of limma vs mixed model approach
to
> analyze microarray and differential proteomics/peptidomics
experiments
> and Poisson/binomial mixed models for RNAseq data
>
> Dear all,
> I was just analyzing some data from a differential (dual label
LC/MS)
> single factor peptidomics experiment using both a per-feature mixed
> model approach (using lme4's lmer function, using a model of the
form
> intensity/abundance ~ (1|run)+label+treatment, and testing
> significance and calculating least square means of the differential
> expression ratios for the different treatments using lmerTest and
> lsmeans, using a Satterthwaite df approximation) and an empirical
> Bayes standard deviation shrinkage method as implemented in limma
(by
> putting my data in an RGLimma object). To my surprise, I am finding
> more statistically significantly differentially expressed features
> using the mixed model approach than using limma (independent of the
> types of normalizations I use in limma, and I did check that the sd
vs abundance was flat).
> Before, I have also observed the same phenomenon in some microarray
> datasets.
>
> I am wondering if this would point towards a mixed model approach
> having superior statistical power than limma, even for this very
simple design.
> Could this be the case? I only did 8 replicate runs in this case
> (corresponding to arrays in 2-color microarray exps), so not
terribly
> many... Could it be that limma has superior statistical power for
the
> analysis of relatively small designs with few experimental
replicates
> (where the SD shrinkage could be beneficial), but that for larger
> number of replicates and also for complex designs (in particular
when
> there are multiple sources of dependencies in the data) a mixed
model
> approach would work better?
>
> Also, would a mixed model approach in general not be much easier to
> specify (just requiring the model formula to be specified, as
opposed
> to all the contrasts, which becomes pretty tedious for complex
> designs) and statistically be more flexible and powerful for
> multifactorial experiments (e.g. easily allowing for treatment
> interaction effects to be tested, and also allowing for multiple
crossed random factors, e.g.
> to take into account several dependencies in the data, e.g. owing to
> spatial correlations in the microarray slides/print tip effects,
other
> random factors, e.g. due to the use of samples with multiple genetic
> backgrounds or the presence of repeated measures factors, etc). For
> repeated measures designs I would also think that mixed model fits
> obtained using nlme could be even more flexible, e.g. allowing for
> various custom error covariance structures (e.g. to take into
account
> temporal autocorrelations, etc); in fact, even lmer supports
> unstructured covariance mode! ls (which would allow variances to be
> different in your different groups, which I think could be quite
> common). Model selection, on the other hand, would appear a little
> more tricky, but also feasible, e.g. by minimizing the AIC of a
model
> that is fit over all features/genes combined (as in
>
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/defaul
t/viewer.htm#statug_hpmixed_sect035.htm#statug.hpmixed.hpmeg4b).
>
> Finally, an important advantage that I would see of using this
> approach is that it would readily generalize to RNAseq data if one
> would use generalized mixed models (glmer's) with a Poisson error
> structure (correcting for the total nr of mapped reads, etc and
other
> important
> covariates) (or, alternatively, using a binomial error structure, to
> analyze the proportion of reads mapping to a particular gene).
> Overdispersion in the data in this approach could readily be taken
> into account by including an observation-level random factor in the
> model, cf.
>
https://stat.ethz.ch/pipermail/r-sig-mixed-
models/2011q1/015867.html.
> The advantage here again would be that complex designs, e.g. with
> multiple crossed random factors, could readily be taken into account
> (to my understanding, edgeR or DEseq do not allow for this, right?).
> Would anyone perhaps have any thoughts why not to go for such a
> generic approach?
>
> Would there perhaps be any market for a package to analyze
> differential expression in limma RGLimma and ESet data (or
> CountDataSet or DGEList objects for RNAseq data) using the approach
I
> outline above? If there is enough interest, I would be keen to
develop
> it, as it would seem pretty straightforward to do...
>
> Cheers,
> Tom Wenseleers
>
>
>
______________________________________________________________________
> _________________
>
> Prof. Tom Wenseleers
> * Lab. of Socioecology and Social Evolution
> Dept. of Biology
> Zoological Institute
> K.U.Leuven
> Naamsestraat 59, box 2466
> B-3000 Leuven
> Belgium
> * +32 (0)16 32 39 64 / +32 (0)472 40 45 96
> * tom.wenseleers at bio.kuleuven.be
>
http://bio.kuleuven.be/ento/wenseleers/twenseleers.htm
>
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:6}}