Statistical power of limma vs mixed model approach to analyze microarray and differentialproteomics/peptidomics experiments and Poisson/binomial mixed models for RNAseq data
2
0
Entering edit mode
@gordon-smyth
Last seen 12 hours ago
WEHI, Melbourne, Australia

Note: this post was in answer to this question: Statistical power of limma vs mixed model approach to analyze microarray and differential proteomics/peptidomics experiments and Poisson/binomial mixed models for RNAseq data

Dear Tom,

I must say that this is first time I have heard anyone suggest that a mixed linear model is easier or simpler than an ordinary linear model. Having used the lme4 package myself, the opposite seems true to me by a very long way.

Do you know that mixed models have been proposed and implemented several times for microarray data?  For example the maanova package or the lmscFit() separate channel approach of the limma package.  A recent paper by Naomi Altman and myself gives a comparison of some of the approaches:

   http://www.biomedcentral.com/1471-2105/14/165

For some designs, a mixed model or separate approach can indeed recover more information than a classic log-ratio analysis.  I don't fully follow the approach that you describe, but it sounds to have some similarities with the limma separate channel approach as well as with maanova.

limma can be generalized to RNA-seq, see

   http://www.statsci.org/smyth/pubs/VoomPreprint.pdf

limma does allow for a treatment of randomized block designs but, apart from this exception, it is certainly true that RNA-seq analysis methods do not allow random effect models.  So there is a gap in the market if you are keen to fill it.

There isn't any data object class called RGLimma that I know of.  Do you mean RGList?

Best wishes
Gordon

limma maanova rnaseq • 5.0k views
ADD COMMENT
0
Entering edit mode
wang peter ★ 2.0k
@wang-peter-4647
Last seen 10.3 years ago
that is a good question. i have a similar question. if i have differential transcriptome and proteomics data under two conditions, like (a1,a2,a3) vs. (b1,b2,b3) and (A1,A2,A3) vs. (B1,B2,B3). a and b means transcriptome RNA DATA. A and B means proteomics DATA. so can i use a mixed linear model to find differential genes between (a1,a2,a3,A1,A2,A3) vs. (b1,b2,b3, B1,B2,B3). On Sat, Jun 29, 2013 at 11:42 PM, Gordon K Smyth <smyth@wehi.edu.au> wrote: > Dear Tom, > > I must say that this is first time I have heard anyone suggest that a > mixed linear model is easier or simpler than an ordinary linear model. > Having used the lme4 package myself, the opposite seems true to me by a > very long way. > > Do you know that mixed models have been proposed and implemented several > times for microarray data? For example the maanova package or the > lmscFit() separate channel approach of the limma package. A recent paper > by Naomi Altman and myself gives a comparison of some of the approaches: > > http://www.biomedcentral.com/**1471-2105/14/165<http: www.biomedc="" entral.com="" 1471-2105="" 14="" 165=""> > > For some designs, a mixed model or separate approach can indeed recover > more information than a classic log-ratio analysis. I don't fully follow > the approach that you describe, but it sounds to have some similarities > with the limma separate channel approach as well as with maanova. > > limma can be generalized to RNA-seq, see > > http://www.statsci.org/smyth/**pubs/VoomPreprint.pdf<http: www.st="" atsci.org="" smyth="" pubs="" voompreprint.pdf=""> > > limma does allow for a treatment of randomized block designs but, apart > from this exception, it is certainly true that RNA-seq analysis methods do > not allow random effect models. So there is a gap in the market if you are > keen to fill it. > > There isn't any data object class called RGLimma that I know of. Do you > mean RGList? > > Best wishes > Gordon > > Date: Sat, 29 Jun 2013 02:49:53 +0000 >> From: Tom Wenseleers <tom.wenseleers@bio.kuleuven.**be<tom.wenseleers@bio.kuleuven.be> >> > >> To: "bioconductor@r-project.org" <bioconductor@r-project.org> >> Subject: [BioC] Statistical power of limma vs mixed model approach to >> analyze microarray and differential proteomics/peptidomics >> experiments >> and Poisson/binomial mixed models for RNAseq data >> >> Dear all, >> > > I was just analyzing some data from a differential (dual label LC/MS) >> single factor peptidomics experiment using both a per-feature mixed model >> approach (using lme4's lmer function, using a model of the form >> intensity/abundance ~ (1|run)+label+treatment, and testing significance and >> calculating least square means of the differential expression ratios for >> the different treatments using lmerTest and lsmeans, using a Satterthwaite >> df approximation) and an empirical Bayes standard deviation shrinkage >> method as implemented in limma (by putting my data in an RGLimma object). >> To my surprise, I am finding more statistically significantly >> differentially expressed features using the mixed model approach than using >> limma (independent of the types of normalizations I use in limma, and I did >> check that the sd vs abundance was flat). Before, I have also observed the >> same phenomenon in some microarray datasets. >> >> I am wondering if this would point towards a mixed model approach having >> superior statistical power than limma, even for this very simple design. >> Could this be the case? I only did 8 replicate runs in this case >> (corresponding to arrays in 2-color microarray exps), so not terribly >> many... Could it be that limma has superior statistical power for the >> analysis of relatively small designs with few experimental replicates >> (where the SD shrinkage could be beneficial), but that for larger number of >> replicates and also for complex designs (in particular when there are >> multiple sources of dependencies in the data) a mixed model approach would >> work better? >> >> Also, would a mixed model approach in general not be much easier to >> specify (just requiring the model formula to be specified, as opposed to >> all the contrasts, which becomes pretty tedious for complex designs) and >> statistically be more flexible and powerful for multifactorial experiments >> (e.g. easily allowing for treatment interaction effects to be tested, and >> also allowing for multiple crossed random factors, e.g. to take into >> account several dependencies in the data, e.g. owing to spatial >> correlations in the microarray slides/print tip effects, other random >> factors, e.g. due to the use of samples with multiple genetic backgrounds >> or the presence of repeated measures factors, etc). For repeated measures >> designs I would also think that mixed model fits obtained using nlme could >> be even more flexible, e.g. allowing for various custom error covariance >> structures (e.g. to take into account temporal autocorrelations, etc); in >> fact, even lmer supports unstructured covariance mode! ls (which would >> allow variances to be different in your different groups, which I think >> could be quite common). Model selection, on the other hand, would appear a >> little more tricky, but also feasible, e.g. by minimizing the AIC of a >> model that is fit over all features/genes combined (as in >> http://support.sas.com/**documentation/cdl/en/statug/** >> 63033/HTML/default/viewer.htm#**statug_hpmixed_sect035.htm#** >> statug.hpmixed.hpmeg4b<http: support.sas.com="" documentation="" cdl="" en="" statug="" 63033="" html="" default="" viewer.htm#statug_hpmixed_sect035.htm%23stat="" ug.hpmixed.hpmeg4b=""> >> ). >> >> Finally, an important advantage that I would see of using this approach >> is that it would readily generalize to RNAseq data if one would use >> generalized mixed models (glmer's) with a Poisson error structure >> (correcting for the total nr of mapped reads, etc and other important >> covariates) (or, alternatively, using a binomial error structure, to >> analyze the proportion of reads mapping to a particular gene). >> Overdispersion in the data in this approach could readily be taken into >> account by including an observation-level random factor in the model, cf. >> https://stat.ethz.ch/**pipermail/r-sig-mixed- models/**2011q1/015867.html<https: stat.ethz.ch="" pipermail="" r-sig-="" mixed-models="" 2011q1="" 015867.html="">. >> The advantage here again would be that complex designs, e.g. with multiple >> crossed random factors, could readily be taken into account (to my >> understanding, edgeR or DEseq do not allow for this, right?). Would anyone >> perhaps have any thoughts why not to go for such a generic approach? >> >> Would there perhaps be any market for a package to analyze differential >> expression in limma RGLimma and ESet data (or CountDataSet or DGEList >> objects for RNAseq data) using the approach I outline above? If there is >> enough interest, I would be keen to develop it, as it would seem pretty >> straightforward to do... >> >> Cheers, >> Tom Wenseleers >> >> >> ______________________________**______________________________** >> ___________________________ >> >> Prof. Tom Wenseleers >> * Lab. of Socioecology and Social Evolution >> Dept. of Biology >> Zoological Institute >> K.U.Leuven >> Naamsestraat 59, box 2466 >> B-3000 Leuven >> Belgium >> * +32 (0)16 32 39 64 / +32 (0)472 40 45 96 >> * tom.wenseleers@bio.kuleuven.be >> http://bio.kuleuven.be/ento/**wenseleers/twenseleers.htm<http: bio="" .kuleuven.be="" ento="" wenseleers="" twenseleers.htm=""> >> >> > ______________________________**______________________________**____ ______ > The information in this email is confidential and inte...{{dropped:25}}
ADD COMMENT
0
Entering edit mode
Dear Peter, So if I get your question right you want to know whether there is a difference in the log ratio estimated from the RNAseq and the proteomics data, right? The problem I think that in this case the error structure for the data sets will be completely different. Perhaps you could do an arcsine square root transform on the proportion of RNAseq counts mapping to a particular gene to more or less get normally distributed residuals, and provided that that would work you might be able to test for a method x treatment interaction effect using a mixed model for a given gene/protein, though you would probably have to use nlme then and specify a custom error structure to allow for the variances to be different depending on the method you used (RNAseq or diff proteomics). Alternatively, you could estimate the 95% confidence intervals on the treatment contrasts in your proteomics and RNAseq datasets using separated mixed models and lsmeans, and then check whether or not they overlap. But probably all this is more hassle than it's worth, given that such datasets usually show very little overlap... Cheers, Tom From: Wang Peter [mailto:wng.peter@gmail.com] Sent: 30 June 2013 05:53 To: Gordon K Smyth Cc: Tom Wenseleers; Bioconductor mailing list Subject: Re: [BioC] Statistical power of limma vs mixed model approach to analyze microarray and differentialproteomics/peptidomics experiments and Poisson/binomial mixed models for RNAseq data that is a good question. i have a similar question. if i have differential transcriptome and proteomics data under two conditions, like (a1,a2,a3) vs. (b1,b2,b3) and (A1,A2,A3) vs. (B1,B2,B3). a and b means transcriptome RNA DATA. A and B means proteomics DATA. so can i use a mixed linear model to find differential genes between (a1,a2,a3,A1,A2,A3) vs. (b1,b2,b3, B1,B2,B3). On Sat, Jun 29, 2013 at 11:42 PM, Gordon K Smyth <smyth@wehi.edu.au<mailto:smyth@wehi.edu.au>> wrote: Dear Tom, I must say that this is first time I have heard anyone suggest that a mixed linear model is easier or simpler than an ordinary linear model. Having used the lme4 package myself, the opposite seems true to me by a very long way. Do you know that mixed models have been proposed and implemented several times for microarray data? For example the maanova package or the lmscFit() separate channel approach of the limma package. A recent paper by Naomi Altman and myself gives a comparison of some of the approaches: http://www.biomedcentral.com/1471-2105/14/165 For some designs, a mixed model or separate approach can indeed recover more information than a classic log-ratio analysis. I don't fully follow the approach that you describe, but it sounds to have some similarities with the limma separate channel approach as well as with maanova. limma can be generalized to RNA-seq, see http://www.statsci.org/smyth/pubs/VoomPreprint.pdf limma does allow for a treatment of randomized block designs but, apart from this exception, it is certainly true that RNA-seq analysis methods do not allow random effect models. So there is a gap in the market if you are keen to fill it. There isn't any data object class called RGLimma that I know of. Do you mean RGList? Best wishes Gordon Date: Sat, 29 Jun 2013 02:49:53 +0000 From: Tom Wenseleers <tom.wenseleers@bio.kuleuven.be<mailto:tom.wensel eers@bio.kuleuven.be="">> To: "bioconductor@r-project.org<mailto:bioconductor@r-project.org>" <bioconductor@r-project.org<mailto:bioconductor@r-project.org>> Subject: [BioC] Statistical power of limma vs mixed model approach to analyze microarray and differential proteomics/peptidomics experiments and Poisson/binomial mixed models for RNAseq data Dear all, I was just analyzing some data from a differential (dual label LC/MS) single factor peptidomics experiment using both a per-feature mixed model approach (using lme4's lmer function, using a model of the form intensity/abundance ~ (1|run)+label+treatment, and testing significance and calculating least square means of the differential expression ratios for the different treatments using lmerTest and lsmeans, using a Satterthwaite df approximation) and an empirical Bayes standard deviation shrinkage method as implemented in limma (by putting my data in an RGLimma object). To my surprise, I am finding more statistically significantly differentially expressed features using the mixed model approach than using limma (independent of the types of normalizations I use in limma, and I did check that the sd vs abundance was flat). Before, I have also observed the same phenomenon in some microarray datasets. I am wondering if this would point towards a mixed model approach having superior statistical power than limma, even for this very simple design. Could this be the case? I only did 8 replicate runs in this case (corresponding to arrays in 2-color microarray exps), so not terribly many... Could it be that limma has superior statistical power for the analysis of relatively small designs with few experimental replicates (where the SD shrinkage could be beneficial), but that for larger number of replicates and also for complex designs (in particular when there are multiple sources of dependencies in the data) a mixed model approach would work better? Also, would a mixed model approach in general not be much easier to specify (just requiring the model formula to be specified, as opposed to all the contrasts, which becomes pretty tedious for complex designs) and statistically be more flexible and powerful for multifactorial experiments (e.g. easily allowing for treatment interaction effects to be tested, and also allowing for multiple crossed random factors, e.g. to take into account several dependencies in the data, e.g. owing to spatial correlations in the microarray slides/print tip effects, other random factors, e.g. due to the use of samples with multiple genetic backgrounds or the presence of repeated measures factors, etc). For repeated measures designs I would also think that mixed model fits obtained using nlme could be even more flexible, e.g. allowing for various custom error covariance structures (e.g. to take into account temporal autocorrelations, etc); in fact, even lmer supports unstructured covariance mode! ls (which would allow variances to be different in your different groups, which I think could be quite common). Model selection, on the other hand, would appear a little more tricky, but also feasible, e.g. by minimizing the AIC of a model that is fit over all features/genes combined (as in htt p://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/vie wer.htm#statug_hpmixed_sect035.htm#statug.hpmixed.hpmeg4b<http: suppo="" rt.sas.com="" documentation="" cdl="" en="" statug="" 63033="" html="" default="" viewer.htm#s="" tatug_hpmixed_sect035.htm%23statug.hpmixed.hpmeg4b="">). Finally, an important advantage that I would see of using this approach is that it would readily generalize to RNAseq data if one would use generalized mixed models (glmer's) with a Poisson error structure (correcting for the total nr of mapped reads, etc and other important covariates) (or, alternatively, using a binomial error structure, to analyze the proportion of reads mapping to a particular gene). Overdispersion in the data in this approach could readily be taken into account by including an observation-level random factor in the model, cf. https://stat.ethz.ch/pipermail/r-sig-mixed- models/2011q1/015867.html. The advantage here again would be that complex designs, e.g. with multiple crossed random factors, could readily be taken into account (to my understanding, edgeR or DEseq do not allow for this, right?). Would anyone perhaps have any thoughts why not to go for such a generic approach? Would there perhaps be any market for a package to analyze differential expression in limma RGLimma and ESet data (or CountDataSet or DGEList objects for RNAseq data) using the approach I outline above? If there is enough interest, I would be keen to develop it, as it would seem pretty straightforward to do... Cheers, Tom Wenseleers ______________________________________________________________________ _________________ Prof. Tom Wenseleers * Lab. of Socioecology and Social Evolution Dept. of Biology Zoological Institute K.U.Leuven Naamsestraat 59, box 2466 B-3000 Leuven Belgium * +32 (0)16 32 39 64<tel:%2b32%20%280%2916%2032%2039%2064> / +32 (0)472 40 45 96<tel:%2b32%20%280%29472%2040%2045%2096> * tom.wenseleers@bio.kuleuven.be<mailto:tom.wenseleers@bio.kuleuven.be> http://bio.kuleuven.be/ento/wenseleers/twenseleers.htm ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:24}}
ADD REPLY
0
Entering edit mode
@tom-wenseleers-3741
Last seen 10.3 years ago
Dear Gordon, Many thanks for your message! Of course, I agree with you that mixed model theory as such is pretty complicated and still to some extent in development. But what I find easier is the specification part of the model, where all you have to do is specify the formula that captures the dependencies/sources of correlation in the data and the fixed factors plus possible interactions that you want to fit, as opposed to the design matrix in limma, which (at least I sometimes find) tricky to specify for more complex multifactorial designs (even with functions designMatrix() and contrast.fit()). Also, the new R packages lmerTest and lsmeans would make a mixed model approach much easier than before, as you can now easily get the desired tests and log2 ratio contrasts and lsmeans (plus confidence limits) for the different treatment levels out of the lme4 fit (as you could in SAS before already for a long time, but alas not in R). I am not saying that all this would be trivial, but it could be made quite easy with a dedicated package I believe. The method would indeed analyze two-colour designs as single-colour ones, and take into account the pairing of R and G labels on the same array using a random factor. The difference though would be that many sources of dependency/correlation could readily be taken into account, e.g. repeated use of the same sample due to dye swaps and technical replication, replicate spots on the same slide, repeated use of samples with the same genetic background, etc etc... Maanova does indeed allow for this approach to some extent, although I believe lme4 combined with lsmeans and lmerTest would be much faster and would allow for more complex designs (e.g. with hierarchically nested random factors, multiple crossed random factors, random slopes models, etc, the SAS example e.g., http://support.sas.com/documentation/cdl/en/stat ug/63033/HTML/default/viewer.htm#statug_hpmixed_sect035.htm, could not be fit using Maanova I believe, whereas I could do it no probs with lme4). And the possible generalization to RNAseq data I think would be very interesting. Maybe I'll start doing some simulations to systematically compare the statistical power of these different approaches for some different experimental designs (many thanks also for the Altman paper, in my case I was surprised to find though that I seemingly got a better statistical power even with a simple single- factor (2 group) 2-colour randomized block design analysed with a mixed model, for which the Altman paper claims that the power of the different approaches should be comparable). Oh yes, and was meaning RGList of course :-) Well, I'll keep you posted if I make some further progress with this - and let me know if you would be interested in co-developing or helping out in some way with such a package! Cheers, Tom -----Original Message----- From: Gordon K Smyth [mailto:smyth@wehi.EDU.AU] Sent: 30 June 2013 05:42 To: Tom Wenseleers Cc: Bioconductor mailing list Subject: Statistical power of limma vs mixed model approach to analyze microarray and differentialproteomics/peptidomics experiments and Poisson/binomial mixed models for RNAseq data Dear Tom, I must say that this is first time I have heard anyone suggest that a mixed linear model is easier or simpler than an ordinary linear model. Having used the lme4 package myself, the opposite seems true to me by a very long way. Do you know that mixed models have been proposed and implemented several times for microarray data? For example the maanova package or the lmscFit() separate channel approach of the limma package. A recent paper by Naomi Altman and myself gives a comparison of some of the approaches: http://www.biomedcentral.com/1471-2105/14/165 For some designs, a mixed model or separate approach can indeed recover more information than a classic log-ratio analysis. I don't fully follow the approach that you describe, but it sounds to have some similarities with the limma separate channel approach as well as with maanova. limma can be generalized to RNA-seq, see http://www.statsci.org/smyth/pubs/VoomPreprint.pdf limma does allow for a treatment of randomized block designs but, apart from this exception, it is certainly true that RNA-seq analysis methods do not allow random effect models. So there is a gap in the market if you are keen to fill it. There isn't any data object class called RGLimma that I know of. Do you mean RGList? Best wishes Gordon > Date: Sat, 29 Jun 2013 02:49:53 +0000 > From: Tom Wenseleers <tom.wenseleers at="" bio.kuleuven.be=""> > To: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> > Subject: [BioC] Statistical power of limma vs mixed model approach to > analyze microarray and differential proteomics/peptidomics experiments > and Poisson/binomial mixed models for RNAseq data > > Dear all, > I was just analyzing some data from a differential (dual label LC/MS) > single factor peptidomics experiment using both a per-feature mixed > model approach (using lme4's lmer function, using a model of the form > intensity/abundance ~ (1|run)+label+treatment, and testing > significance and calculating least square means of the differential > expression ratios for the different treatments using lmerTest and > lsmeans, using a Satterthwaite df approximation) and an empirical > Bayes standard deviation shrinkage method as implemented in limma (by > putting my data in an RGLimma object). To my surprise, I am finding > more statistically significantly differentially expressed features > using the mixed model approach than using limma (independent of the > types of normalizations I use in limma, and I did check that the sd vs abundance was flat). > Before, I have also observed the same phenomenon in some microarray > datasets. > > I am wondering if this would point towards a mixed model approach > having superior statistical power than limma, even for this very simple design. > Could this be the case? I only did 8 replicate runs in this case > (corresponding to arrays in 2-color microarray exps), so not terribly > many... Could it be that limma has superior statistical power for the > analysis of relatively small designs with few experimental replicates > (where the SD shrinkage could be beneficial), but that for larger > number of replicates and also for complex designs (in particular when > there are multiple sources of dependencies in the data) a mixed model > approach would work better? > > Also, would a mixed model approach in general not be much easier to > specify (just requiring the model formula to be specified, as opposed > to all the contrasts, which becomes pretty tedious for complex > designs) and statistically be more flexible and powerful for > multifactorial experiments (e.g. easily allowing for treatment > interaction effects to be tested, and also allowing for multiple crossed random factors, e.g. > to take into account several dependencies in the data, e.g. owing to > spatial correlations in the microarray slides/print tip effects, other > random factors, e.g. due to the use of samples with multiple genetic > backgrounds or the presence of repeated measures factors, etc). For > repeated measures designs I would also think that mixed model fits > obtained using nlme could be even more flexible, e.g. allowing for > various custom error covariance structures (e.g. to take into account > temporal autocorrelations, etc); in fact, even lmer supports > unstructured covariance mode! ls (which would allow variances to be > different in your different groups, which I think could be quite > common). Model selection, on the other hand, would appear a little > more tricky, but also feasible, e.g. by minimizing the AIC of a model > that is fit over all features/genes combined (as in > http://support.sas.com/documentation/cdl/en/statug/63033/HTML/defaul t/viewer.htm#statug_hpmixed_sect035.htm#statug.hpmixed.hpmeg4b). > > Finally, an important advantage that I would see of using this > approach is that it would readily generalize to RNAseq data if one > would use generalized mixed models (glmer's) with a Poisson error > structure (correcting for the total nr of mapped reads, etc and other > important > covariates) (or, alternatively, using a binomial error structure, to > analyze the proportion of reads mapping to a particular gene). > Overdispersion in the data in this approach could readily be taken > into account by including an observation-level random factor in the > model, cf. > https://stat.ethz.ch/pipermail/r-sig-mixed- models/2011q1/015867.html. > The advantage here again would be that complex designs, e.g. with > multiple crossed random factors, could readily be taken into account > (to my understanding, edgeR or DEseq do not allow for this, right?). > Would anyone perhaps have any thoughts why not to go for such a > generic approach? > > Would there perhaps be any market for a package to analyze > differential expression in limma RGLimma and ESet data (or > CountDataSet or DGEList objects for RNAseq data) using the approach I > outline above? If there is enough interest, I would be keen to develop > it, as it would seem pretty straightforward to do... > > Cheers, > Tom Wenseleers > > > ______________________________________________________________________ > _________________ > > Prof. Tom Wenseleers > * Lab. of Socioecology and Social Evolution > Dept. of Biology > Zoological Institute > K.U.Leuven > Naamsestraat 59, box 2466 > B-3000 Leuven > Belgium > * +32 (0)16 32 39 64 / +32 (0)472 40 45 96 > * tom.wenseleers at bio.kuleuven.be > http://bio.kuleuven.be/ento/wenseleers/twenseleers.htm > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}
ADD COMMENT
0
Entering edit mode

what I find easier is the specification part of the model, where all you have to do is specify the formula that captures the dependencies/sources of correlation in the data and the fixed factors plus possible interactions that you want to fit,

You can use the same model formula (for fixed factors) with limma. This is true even for a two colour design, provided you use lmscFit() instead of lmFit().

as opposed to the design matrix in limma, which (at least I sometimes find) tricky to specify for more complex multifactorial designs (even with functions designMatrix() and contrast.fit()).

The designMatrix() function hasn't existed for nine years. (It was renamed to modelMatrix.) Are you using an up-to-date version of limma?

Maybe I'll start doing some simulations to systematically compare the statistical power of these different approaches for some different experimental designs (many thanks also for the Altman paper, in my case I was surprised to find though that I seemingly got a better statistical power even with a simple single-factor (2 group) 2-colour randomized block design analysed with a mixed model, for which the Altman paper claims that the power of the different approaches should be comparable).

Is your simple design a direct comparison of A vs B on the same arrays, without any common reference? If so, then I don't think that there is any extra information in the data that could be extracted by a mixed model. So, if the mixed model is giving a lot more DE genes, I would worry that it is not holding its size correctly. It would be a good idea to do some simulations to check that the mixed model method is controlling the type I error rate at the correct level.

Best wishes
Gordon

ADD REPLY
0
Entering edit mode
Dear Gordon, Many thanks for your further comments! I was aware of the model formula specification in limma, which is indeed very handy - but in my experiments I often have multiple random factors, that's the problem... And sorry meant modelMatrix of course, was looking at some old documentation for some reason... And yes, my A vs B's were on the same arrays - I'll go to do some simulations to check the type I error rate, as you suggest! There wasn't a huge difference with limma though in this case, so maybe it was just a matter of chance that I found more DE features with a mixed model (though I did find large differences before for some complex designs with multiple random factors). On the other hand, I could well see that moderated t-tests might not completely equivalent to a mixed model approach, even if the amount of information they use would be equivalent (one being partly Bayesian in logic, and the other frequentist)... Could the difference perhaps arise from the fact that there are very large differences in the per-treatment group variance on the expression values for the different genes in my analysis? Cheers, Tom -----Original Message----- From: Gordon K Smyth [mailto:smyth@wehi.EDU.AU] Sent: 30 June 2013 11:18 To: Tom Wenseleers Cc: Bioconductor mailing list Subject: RE: Statistical power of limma vs mixed model approach to analyze microarray and differentialproteomics/peptidomics experiments and Poisson/binomial mixed models for RNAseq data On Sun, 30 Jun 2013, Tom Wenseleers wrote: > Dear Gordon, > Many thanks for your message! Of course, I agree with you that mixed > model theory as such is pretty complicated and still to some extent in > development. But what I find easier is the specification part of the > model, where all you have to do is specify the formula that captures > the dependencies/sources of correlation in the data and the fixed > factors plus possible interactions that you want to fit, You can use the same model formula (for fixed factors) with limma. This is true even for a two colour design, provided you use lmscFit() instead of lmFit(). > as opposed to the design matrix in limma, which (at least I sometimes > find) tricky to specify for more complex multifactorial designs (even > with functions designMatrix() and contrast.fit()). The designMatrix() function hasn't existed for nine years. (It was renamed to modelMatrix.) Are you using an up-to-date version of limma? > Also, the new R packages lmerTest and lsmeans would make a mixed model > approach much easier than before, as you can now easily get the > desired tests and log2 ratio contrasts and lsmeans (plus confidence > limits) for the different treatment levels out of the lme4 fit (as you > could in SAS before already for a long time, but alas not in R). I am > not saying that all this would be trivial, but it could be made quite > easy with a dedicated package I believe. > > The method would indeed analyze two-colour designs as single-colour > ones, and take into account the pairing of R and G labels on the same > array using a random factor. The difference though would be that many > sources of dependency/correlation could readily be taken into account, > e.g. repeated use of the same sample due to dye swaps and technical > replication, replicate spots on the same slide, repeated use of > samples with the same genetic background, etc etc... Maanova does > indeed allow for this approach to some extent, although I believe lme4 > combined with lsmeans and lmerTest would be much faster and would > allow for more complex designs (e.g. with hierarchically nested random > factors, multiple crossed random factors, random slopes models, etc, > the SAS example e.g., > http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/ > viewer.htm#statug_hpmixed_sect035.htm, > could not be fit using Maanova I believe, whereas I could do it no > probs with lme4). And the possible generalization to RNAseq data I > think would be very interesting. Maybe I'll start doing some > simulations to systematically compare the statistical power of these > different approaches for some different experimental designs (many > thanks also for the Altman paper, in my case I was surprised to find > though that I seemingly got a better statistical power even with a > simple single-factor (2 group) 2-colour randomized block design > analysed with a mixed model, for which the Altman paper claims that > the power of the different approaches should be comparable). Is your simple design a direct comparison of A vs B on the same arrays, without any common reference? If so, then I don't think that there is any extra information in the data that could be extracted by a mixed model. So, if the mixed model is giving a lot more DE genes, I would worry that it is not holding its size correctly. It would be a good idea to do some simulations to check that the mixed model method is controlling the type I error rate at the correct level. Best wishes Gordon > Oh yes, and was meaning RGList of course :-) > > Well, I'll keep you posted if I make some further progress with this - > and let me know if you would be interested in co-developing or helping > out in some way with such a package! > > Cheers, > Tom > > -----Original Message----- > From: Gordon K Smyth [mailto:smyth at wehi.EDU.AU] > Sent: 30 June 2013 05:42 > To: Tom Wenseleers > Cc: Bioconductor mailing list > Subject: Statistical power of limma vs mixed model approach to analyze > microarray and differentialproteomics/peptidomics experiments and > Poisson/binomial mixed models for RNAseq data > > Dear Tom, > > I must say that this is first time I have heard anyone suggest that a > mixed linear model is easier or simpler than an ordinary linear model. > Having used the lme4 package myself, the opposite seems true to me by > a very long way. > > Do you know that mixed models have been proposed and implemented > several times for microarray data? For example the maanova package or > the > lmscFit() separate channel approach of the limma package. A recent > paper by Naomi Altman and myself gives a comparison of some of the > approaches: > > http://www.biomedcentral.com/1471-2105/14/165 > > For some designs, a mixed model or separate approach can indeed > recover more information than a classic log-ratio analysis. I don't > fully follow the approach that you describe, but it sounds to have > some similarities with the limma separate channel approach as well as > with maanova. > > limma can be generalized to RNA-seq, see > > http://www.statsci.org/smyth/pubs/VoomPreprint.pdf > > limma does allow for a treatment of randomized block designs but, > apart from this exception, it is certainly true that RNA-seq analysis > methods do not allow random effect models. So there is a gap in the > market if you are keen to fill it. > > There isn't any data object class called RGLimma that I know of. Do > you mean RGList? > > Best wishes > Gordon > >> Date: Sat, 29 Jun 2013 02:49:53 +0000 >> From: Tom Wenseleers <tom.wenseleers at="" bio.kuleuven.be=""> >> To: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> >> Subject: [BioC] Statistical power of limma vs mixed model approach to >> analyze microarray and differential proteomics/peptidomics experiments >> and Poisson/binomial mixed models for RNAseq data >> >> Dear all, > >> I was just analyzing some data from a differential (dual label LC/MS) >> single factor peptidomics experiment using both a per-feature mixed >> model approach (using lme4's lmer function, using a model of the form >> intensity/abundance ~ (1|run)+label+treatment, and testing >> significance and calculating least square means of the differential >> expression ratios for the different treatments using lmerTest and >> lsmeans, using a Satterthwaite df approximation) and an empirical >> Bayes standard deviation shrinkage method as implemented in limma (by >> putting my data in an RGLimma object). To my surprise, I am finding >> more statistically significantly differentially expressed features >> using the mixed model approach than using limma (independent of the >> types of normalizations I use in limma, and I did check that the sd vs abundance was flat). >> Before, I have also observed the same phenomenon in some microarray >> datasets. >> >> I am wondering if this would point towards a mixed model approach >> having superior statistical power than limma, even for this very simple design. >> Could this be the case? I only did 8 replicate runs in this case >> (corresponding to arrays in 2-color microarray exps), so not terribly >> many... Could it be that limma has superior statistical power for the >> analysis of relatively small designs with few experimental replicates >> (where the SD shrinkage could be beneficial), but that for larger >> number of replicates and also for complex designs (in particular when >> there are multiple sources of dependencies in the data) a mixed model >> approach would work better? >> >> Also, would a mixed model approach in general not be much easier to >> specify (just requiring the model formula to be specified, as opposed >> to all the contrasts, which becomes pretty tedious for complex >> designs) and statistically be more flexible and powerful for >> multifactorial experiments (e.g. easily allowing for treatment >> interaction effects to be tested, and also allowing for multiple crossed random factors, e.g. >> to take into account several dependencies in the data, e.g. owing to >> spatial correlations in the microarray slides/print tip effects, >> other random factors, e.g. due to the use of samples with multiple >> genetic backgrounds or the presence of repeated measures factors, >> etc). For repeated measures designs I would also think that mixed >> model fits obtained using nlme could be even more flexible, e.g. >> allowing for various custom error covariance structures (e.g. to take >> into account temporal autocorrelations, etc); in fact, even lmer >> supports unstructured covariance mode! ls (which would allow >> variances to be different in your different groups, which I think >> could be quite common). Model selection, on the other hand, would >> appear a little more tricky, but also feasible, e.g. by minimizing >> the AIC of a model that is fit over all features/genes combined (as >> in http://support.sas.com/documentation/cdl/en/statug/63033/HTML/de fault/viewer.htm#statug_hpmixed_sect035.htm#statug.hpmixed.hpmeg4b). >> >> Finally, an important advantage that I would see of using this >> approach is that it would readily generalize to RNAseq data if one >> would use generalized mixed models (glmer's) with a Poisson error >> structure (correcting for the total nr of mapped reads, etc and other >> important >> covariates) (or, alternatively, using a binomial error structure, to >> analyze the proportion of reads mapping to a particular gene). >> Overdispersion in the data in this approach could readily be taken >> into account by including an observation-level random factor in the >> model, cf. >> https://stat.ethz.ch/pipermail/r-sig-mixed- models/2011q1/015867.html. >> The advantage here again would be that complex designs, e.g. with >> multiple crossed random factors, could readily be taken into account >> (to my understanding, edgeR or DEseq do not allow for this, right?). >> Would anyone perhaps have any thoughts why not to go for such a >> generic approach? >> >> Would there perhaps be any market for a package to analyze >> differential expression in limma RGLimma and ESet data (or >> CountDataSet or DGEList objects for RNAseq data) using the approach I >> outline above? If there is enough interest, I would be keen to >> develop it, as it would seem pretty straightforward to do... >> >> Cheers, >> Tom Wenseleers >> >> >> _____________________________________________________________________ >> _ >> _________________ >> >> Prof. Tom Wenseleers >> * Lab. of Socioecology and Social Evolution >> Dept. of Biology >> Zoological Institute >> K.U.Leuven >> Naamsestraat 59, box 2466 >> B-3000 Leuven >> Belgium >> * +32 (0)16 32 39 64 / +32 (0)472 40 45 96 >> * tom.wenseleers at bio.kuleuven.be >> http://bio.kuleuven.be/ento/wenseleers/twenseleers.htm ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}
ADD REPLY

Login before adding your answer.

Traffic: 847 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6