Your thoughts on Limma Array Weights?
2
0
Entering edit mode
Paul Leo ▴ 970
@paul-leo-2092
Last seen 9.6 years ago
I use limma quite a bit but have not really been using arrayWeights much, until recently. I like it a lot but have found, in some cases, that it appears better just ditch the very poorly performing arrays..and then I proceed without weighing . What are peoples real world experience with arrayWeights, are you using it routinely ? For example my typical usage... time series with biological triplicates > design t0hr t6hr t24hr t24p6hr 1 1 0 0 0 2 1 0 0 0 3 1 0 0 0 4 0 1 0 0 5 0 1 0 0 6 0 1 0 0 7 0 0 1 0 8 0 0 1 0 9 0 0 1 0 10 0 0 0 1 11 0 0 0 1 12 0 0 0 1 arrayw<-arrayWeights(selDataMatrix,design=design) > arrayw 1 2 3 4 5 6 7 8 9 1.6473168 1.2716081 1.5170375 1.0310794 1.1010048 1.2787543 0.8198722 0.7162097 2.3992850 10 11 12 0.1744961 1.3821469 0.6379648 ## note array 10: which was a outlier in hierarchical clustering (though was still more similar to arrays its biological replicates than any other arrays (based on genes where sd/mean> 0.1).. fit <- lmFit(selDataMatrix, design,weights=arrayw) fit <- lmFit(selDataMatrix, design) cont.matrix <- makeContrasts( tchange6hr="t6hr-t0hr" , tchange24hr="t24hr-t0hr" , tchange24p6hr="t24p6hr-t0hr" , diff0to6="t6hr-t0hr" , diff6to24="t24hr-t6hr" , diff24to24p6="t24p6hr-t24hr" , levels=design) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2) ** Get > sum(topTable(fit2,coef=1,adjust="fdr",number=5000)[,"B"]>1) [1] 2927 > sum(topTable(fit2,coef=2,adjust="fdr",number=6000)[,"B"]>1) [1] 5263 > sum(topTable(fit2,coef=3,adjust="fdr",number=5000)[,"B"]>1) [1] 2083 > sum(topTable(fit2,coef=4,adjust="fdr",number=5000)[,"B"]>1) [1] 2927 > sum(topTable(fit2,coef=5,adjust="fdr",number=5000)[,"B"]>1) [1] 2931 > sum(topTable(fit2,coef=6,adjust="fdr",number=5000)[,"B"]>1) [1] 3810 ####################### AS APPOSED TO THE TYPICAL: fit <- lmFit(selDataMatrix, design) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2) ** Get > sum(topTable(fit2,coef=1,adjust="fdr",number=5000)[,"B"]>1) [1] 1725 > sum(topTable(fit2,coef=2,adjust="fdr",number=6000)[,"B"]>1) [1] 3438 > sum(topTable(fit2,coef=3,adjust="fdr",number=5000)[,"B"]>1) [1] 1512 > sum(topTable(fit2,coef=4,adjust="fdr",number=5000)[,"B"]>1) [1] 1725 > sum(topTable(fit2,coef=5,adjust="fdr",number=5000)[,"B"]>1) [1] 1605 > sum(topTable(fit2,coef=6,adjust="fdr",number=5000)[,"B"]>1) [1] 2610 Is more differential expression better .. always... I guess so unless there are more false positives right? I am slightly worried that in using a linear model to access array quality and produce weights , that this will then naturally bias a method such as limma that then using a linear model, again, to determine differential expression. After trying a few different permutations (use weights, remove "worst" arrays and redo without weights) that this is not a big concern but would welcome some feedback from others and insight into how they are using this function . Thanks Paul [[alternative HTML version deleted]]
Clustering limma Clustering limma • 2.9k views
ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 6 hours ago
WEHI, Melbourne, Australia

Matt and Paul have already given nice summaries, so I will add only very brief comments.

  1. If most genes are not DE, then the design matrix for arrayWeights does not need to be as complex as for the final linear model. Eg., in the 2 vs 2 case you should usually use design=1 to estimate the array weights, otherwise each array is compared only to its partner rather than to the other 3 arrays.

  2. I use array weights when there is some reason to expect variable array quality, otherwise I don't. Eg, RNA samples from human patients almost always vary in quality, so I almost always use array weights with human in vivo data. See eg Ellis et al, Clinical Cancer Research, to appear 14 July 2008. If array quality plots suggest a problem, then I use weights. If RNA is plentiful, eg from cell lines or model organisms, and the arrays look good, then I don't use array weights.

  3. In some cases, it is better to remove arrays altogether. However Matt and I believe that people are generally too trigger-happy with this. Arrays have to be very bad indeed before they contain no useful information, so there is a wide range of situations in which better results can be had by weighting. In gross cases where an array is clearly bad or wrong, and should be removed, you'll usually know it.

  4. The arrayWeights method is largely protected from double-dipping by the fact that each gene has negligible influence on the array weights.

Best wishes
Gordon

ADD COMMENT
0
Entering edit mode
Matt Ritchie ▴ 460
@matt-ritchie-2048
Last seen 9.6 years ago
Dear Paul, I have noticed cases where the results are 'better' (i.e. you get more extreme moderated t-statistics or log-odds) if you remove suspect arrays. In one recent example I recall, the experimenter eventually discovered that the genotype of a sample hybridised to one of their arrays was not what they originally thought. This meant that the linear model they were fitting was not right. Although the weight assigned to this array was small, removing it from the analysis altogether still produced better results. The array weights method cannot correct for these kinds of gross errors. I usually take a try it and see approach in my own analyses, similar to what you have done (i.e. run the analysis with equal weights, with array weights, or after removing any suspect arrays altogether, then look at the results of each to see which gives the most extreme statistics). Best wishes, Matt > I use limma quite a bit but have not really been using arrayWeights > much, until recently. > I like it a lot but have found, in some cases, that it appears better > just ditch the very poorly performing arrays..and then I proceed without > weighing . > > What are peoples real world experience with arrayWeights, are you using > it routinely ? > > For example my typical usage... time series with biological triplicates > >> design > t0hr t6hr t24hr t24p6hr > 1 1 0 0 0 > 2 1 0 0 0 > 3 1 0 0 0 > 4 0 1 0 0 > 5 0 1 0 0 > 6 0 1 0 0 > 7 0 0 1 0 > 8 0 0 1 0 > 9 0 0 1 0 > 10 0 0 0 1 > 11 0 0 0 1 > 12 0 0 0 1 > > arrayw<-arrayWeights(selDataMatrix,design=design) >> arrayw > 1 2 3 4 5 6 7 > 8 9 > 1.6473168 1.2716081 1.5170375 1.0310794 1.1010048 1.2787543 0.8198722 > 0.7162097 2.3992850 > 10 11 12 > 0.1744961 1.3821469 0.6379648 ## note array 10: which was a outlier in > hierarchical clustering (though was still more similar to arrays its > biological replicates than any other arrays (based on genes where > sd/mean> 0.1).. > > fit <- lmFit(selDataMatrix, design,weights=arrayw) > fit <- lmFit(selDataMatrix, design) > > cont.matrix <- makeContrasts( > tchange6hr="t6hr-t0hr" , > tchange24hr="t24hr-t0hr" , > tchange24p6hr="t24p6hr-t0hr" , > diff0to6="t6hr-t0hr" , > diff6to24="t24hr-t6hr" , > diff24to24p6="t24p6hr-t24hr" , > levels=design) > > fit2 <- contrasts.fit(fit, cont.matrix) > fit2 <- eBayes(fit2) > > ** Get >> sum(topTable(fit2,coef=1,adjust="fdr",number=5000)[,"B"]>1) > [1] 2927 >> sum(topTable(fit2,coef=2,adjust="fdr",number=6000)[,"B"]>1) > [1] 5263 >> sum(topTable(fit2,coef=3,adjust="fdr",number=5000)[,"B"]>1) > [1] 2083 >> sum(topTable(fit2,coef=4,adjust="fdr",number=5000)[,"B"]>1) > [1] 2927 >> sum(topTable(fit2,coef=5,adjust="fdr",number=5000)[,"B"]>1) > [1] 2931 >> sum(topTable(fit2,coef=6,adjust="fdr",number=5000)[,"B"]>1) > [1] 3810 > > ####################### AS APPOSED TO THE TYPICAL: > > fit <- lmFit(selDataMatrix, design) > fit2 <- contrasts.fit(fit, cont.matrix) > fit2 <- eBayes(fit2) > > ** Get >> sum(topTable(fit2,coef=1,adjust="fdr",number=5000)[,"B"]>1) > [1] 1725 >> sum(topTable(fit2,coef=2,adjust="fdr",number=6000)[,"B"]>1) > [1] 3438 >> sum(topTable(fit2,coef=3,adjust="fdr",number=5000)[,"B"]>1) > [1] 1512 >> sum(topTable(fit2,coef=4,adjust="fdr",number=5000)[,"B"]>1) > [1] 1725 >> sum(topTable(fit2,coef=5,adjust="fdr",number=5000)[,"B"]>1) > [1] 1605 >> sum(topTable(fit2,coef=6,adjust="fdr",number=5000)[,"B"]>1) > [1] 2610 > > Is more differential expression better .. always... I guess so unless > there are more false positives right? I am slightly worried that in > using a linear model to access array quality and produce weights , that > this will then naturally bias a method such as limma that then using a > linear model, again, to determine differential expression. After trying > a few different permutations (use weights, remove "worst" arrays and > redo without weights) that this is not a big concern but would welcome > some feedback from others and insight into how they are using this > function . > > Thanks > Paul
ADD COMMENT
0
Entering edit mode
Hi Paul and Matt, have either of you compared situations with only small number of arrays in a 2 group comparison, eg 2 vs 2, or 3 vs 3 and your either throw one array away (due to QC), or just down-weight it? I commonly throw the poor array away, but one data set that i'm currently working on only has 2 vs 2 (read: "pilot experiment"), and when you remove an array, then it's 2 vs 1 which is not much fun. cheers, Mark On 27/06/2008, at 9:53 AM, Matt Ritchie wrote: > Dear Paul, > > I have noticed cases where the results are 'better' (i.e. you get more > extreme moderated t-statistics or log-odds) if you remove suspect > arrays. > In one recent example I recall, the experimenter eventually > discovered that > the genotype of a sample hybridised to one of their arrays was not > what they > originally thought. This meant that the linear model they were > fitting was > not right. Although the weight assigned to this array was small, > removing > it from the analysis altogether still produced better results. The > array > weights method cannot correct for these kinds of gross errors. > > I usually take a try it and see approach in my own analyses, similar > to what > you have done (i.e. run the analysis with equal weights, with array > weights, > or after removing any suspect arrays altogether, then look at the > results of > each to see which gives the most extreme statistics). > > Best wishes, > > Matt > >> I use limma quite a bit but have not really been using arrayWeights >> much, until recently. >> I like it a lot but have found, in some cases, that it appears >> better >> just ditch the very poorly performing arrays..and then I proceed >> without >> weighing . >> >> What are peoples real world experience with arrayWeights, are you >> using >> it routinely ? >> >> For example my typical usage... time series with biological >> triplicates >> >>> design >> t0hr t6hr t24hr t24p6hr >> 1 1 0 0 0 >> 2 1 0 0 0 >> 3 1 0 0 0 >> 4 0 1 0 0 >> 5 0 1 0 0 >> 6 0 1 0 0 >> 7 0 0 1 0 >> 8 0 0 1 0 >> 9 0 0 1 0 >> 10 0 0 0 1 >> 11 0 0 0 1 >> 12 0 0 0 1 >> >> arrayw<-arrayWeights(selDataMatrix,design=design) >>> arrayw >> 1 2 3 4 5 6 7 >> 8 9 >> 1.6473168 1.2716081 1.5170375 1.0310794 1.1010048 1.2787543 0.8198722 >> 0.7162097 2.3992850 >> 10 11 12 >> 0.1744961 1.3821469 0.6379648 ## note array 10: which was a >> outlier in >> hierarchical clustering (though was still more similar to arrays its >> biological replicates than any other arrays (based on genes where >> sd/mean> 0.1).. >> >> fit <- lmFit(selDataMatrix, design,weights=arrayw) >> fit <- lmFit(selDataMatrix, design) >> >> cont.matrix <- makeContrasts( >> tchange6hr="t6hr-t0hr" , >> tchange24hr="t24hr-t0hr" , >> tchange24p6hr="t24p6hr-t0hr" , >> diff0to6="t6hr-t0hr" , >> diff6to24="t24hr-t6hr" , >> diff24to24p6="t24p6hr-t24hr" , >> levels=design) >> >> fit2 <- contrasts.fit(fit, cont.matrix) >> fit2 <- eBayes(fit2) >> >> ** Get >>> sum(topTable(fit2,coef=1,adjust="fdr",number=5000)[,"B"]>1) >> [1] 2927 >>> sum(topTable(fit2,coef=2,adjust="fdr",number=6000)[,"B"]>1) >> [1] 5263 >>> sum(topTable(fit2,coef=3,adjust="fdr",number=5000)[,"B"]>1) >> [1] 2083 >>> sum(topTable(fit2,coef=4,adjust="fdr",number=5000)[,"B"]>1) >> [1] 2927 >>> sum(topTable(fit2,coef=5,adjust="fdr",number=5000)[,"B"]>1) >> [1] 2931 >>> sum(topTable(fit2,coef=6,adjust="fdr",number=5000)[,"B"]>1) >> [1] 3810 >> >> ####################### AS APPOSED TO THE TYPICAL: >> >> fit <- lmFit(selDataMatrix, design) >> fit2 <- contrasts.fit(fit, cont.matrix) >> fit2 <- eBayes(fit2) >> >> ** Get >>> sum(topTable(fit2,coef=1,adjust="fdr",number=5000)[,"B"]>1) >> [1] 1725 >>> sum(topTable(fit2,coef=2,adjust="fdr",number=6000)[,"B"]>1) >> [1] 3438 >>> sum(topTable(fit2,coef=3,adjust="fdr",number=5000)[,"B"]>1) >> [1] 1512 >>> sum(topTable(fit2,coef=4,adjust="fdr",number=5000)[,"B"]>1) >> [1] 1725 >>> sum(topTable(fit2,coef=5,adjust="fdr",number=5000)[,"B"]>1) >> [1] 1605 >>> sum(topTable(fit2,coef=6,adjust="fdr",number=5000)[,"B"]>1) >> [1] 2610 >> >> Is more differential expression better .. always... I guess so >> unless >> there are more false positives right? I am slightly worried that in >> using a linear model to access array quality and produce weights , >> that >> this will then naturally bias a method such as limma that then >> using a >> linear model, again, to determine differential expression. After >> trying >> a few different permutations (use weights, remove "worst" arrays and >> redo without weights) that this is not a big concern but would >> welcome >> some feedback from others and insight into how they are using this >> function . >> >> Thanks >> Paul > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
HI Mark, The example I showed was a triplicate biological replicate and there I decided to do use the weights over removing the strange array. In a previous experiment I removed the "strange" array and did not use weights. But I did not elaborate; in that latter case there was some biology at work. The cells used in that experiment were all grown form tissues and then passed through a process of differentiation. So I had reason to believe the strange array was different due to biology (state of differentiation perhaps) rather than some other "more technical" effect (as the array and RNA otherwise appeared fine in other QC measurements). So with that additional motivation I decided to drop it, as it appeared to be a true biological outlier, and no magical weighting can fix that, but you never know for sure if you've made the right decision, you can only see if what you get makes sense... use GOstats etc. The paper describing the arrayWeights method http://www.biomedcentral.com/content/pdf/1471-2105-7-261.pdf shows that in simulations with 3 arrays per class that dropping a poor array is worse than just keeping it and the best performance is obtained from the weighting strategy. **My impressions** from hearing Gordon Smyth talk is that his philosophy is to keep the arrays you have and weight them appropriately, to mot remove them. Does he always weight them in his own work, I don't know. It's probably in the grey judgment call zone, like everything else. That paper also shows the same results for a simple t-test, arguing that there is no strange "double-dipping in linear models" when using linear models to get weights and a linear model to get differential expression (though of course neither is just a simple linear model) So, along the same lines unless there is something particularly glaring you should keep your 2 arrays per class and perhaps try the weighing, and gently remind your investigators (using bolt-cutters and a blow-torch) how difficult it is to get a standard deviation from 2 measurements... Best of luck Paul -----Original Message----- From: Mark Cowley [mailto:m.cowley0@gmail.com] Sent: Friday, June 27, 2008 11:37 AM To: Matt Ritchie Cc: Paul Leo; bioconductor Subject: Re: [BioC] Your thoughts on Limma Array Weights? Hi Paul and Matt, have either of you compared situations with only small number of arrays in a 2 group comparison, eg 2 vs 2, or 3 vs 3 and your either throw one array away (due to QC), or just down-weight it? I commonly throw the poor array away, but one data set that i'm currently working on only has 2 vs 2 (read: "pilot experiment"), and when you remove an array, then it's 2 vs 1 which is not much fun. cheers, Mark On 27/06/2008, at 9:53 AM, Matt Ritchie wrote: > Dear Paul, > > I have noticed cases where the results are 'better' (i.e. you get more > extreme moderated t-statistics or log-odds) if you remove suspect > arrays. > In one recent example I recall, the experimenter eventually > discovered that > the genotype of a sample hybridised to one of their arrays was not > what they > originally thought. This meant that the linear model they were > fitting was > not right. Although the weight assigned to this array was small, > removing > it from the analysis altogether still produced better results. The > array > weights method cannot correct for these kinds of gross errors. > > I usually take a try it and see approach in my own analyses, similar > to what > you have done (i.e. run the analysis with equal weights, with array > weights, > or after removing any suspect arrays altogether, then look at the > results of > each to see which gives the most extreme statistics). > > Best wishes, > > Matt > >> I use limma quite a bit but have not really been using arrayWeights >> much, until recently. >> I like it a lot but have found, in some cases, that it appears >> better >> just ditch the very poorly performing arrays..and then I proceed >> without >> weighing . >> >> What are peoples real world experience with arrayWeights, are you >> using >> it routinely ? >> >> For example my typical usage... time series with biological >> triplicates >> >>> design >> t0hr t6hr t24hr t24p6hr >> 1 1 0 0 0 >> 2 1 0 0 0 >> 3 1 0 0 0 >> 4 0 1 0 0 >> 5 0 1 0 0 >> 6 0 1 0 0 >> 7 0 0 1 0 >> 8 0 0 1 0 >> 9 0 0 1 0 >> 10 0 0 0 1 >> 11 0 0 0 1 >> 12 0 0 0 1 >> >> arrayw<-arrayWeights(selDataMatrix,design=design) >>> arrayw >> 1 2 3 4 5 6 7 >> 8 9 >> 1.6473168 1.2716081 1.5170375 1.0310794 1.1010048 1.2787543 0.8198722 >> 0.7162097 2.3992850 >> 10 11 12 >> 0.1744961 1.3821469 0.6379648 ## note array 10: which was a >> outlier in >> hierarchical clustering (though was still more similar to arrays its >> biological replicates than any other arrays (based on genes where >> sd/mean> 0.1).. >> >> fit <- lmFit(selDataMatrix, design,weights=arrayw) >> fit <- lmFit(selDataMatrix, design) >> >> cont.matrix <- makeContrasts( >> tchange6hr="t6hr-t0hr" , >> tchange24hr="t24hr-t0hr" , >> tchange24p6hr="t24p6hr-t0hr" , >> diff0to6="t6hr-t0hr" , >> diff6to24="t24hr-t6hr" , >> diff24to24p6="t24p6hr-t24hr" , >> levels=design) >> >> fit2 <- contrasts.fit(fit, cont.matrix) >> fit2 <- eBayes(fit2) >> >> ** Get >>> sum(topTable(fit2,coef=1,adjust="fdr",number=5000)[,"B"]>1) >> [1] 2927 >>> sum(topTable(fit2,coef=2,adjust="fdr",number=6000)[,"B"]>1) >> [1] 5263 >>> sum(topTable(fit2,coef=3,adjust="fdr",number=5000)[,"B"]>1) >> [1] 2083 >>> sum(topTable(fit2,coef=4,adjust="fdr",number=5000)[,"B"]>1) >> [1] 2927 >>> sum(topTable(fit2,coef=5,adjust="fdr",number=5000)[,"B"]>1) >> [1] 2931 >>> sum(topTable(fit2,coef=6,adjust="fdr",number=5000)[,"B"]>1) >> [1] 3810 >> >> ####################### AS APPOSED TO THE TYPICAL: >> >> fit <- lmFit(selDataMatrix, design) >> fit2 <- contrasts.fit(fit, cont.matrix) >> fit2 <- eBayes(fit2) >> >> ** Get >>> sum(topTable(fit2,coef=1,adjust="fdr",number=5000)[,"B"]>1) >> [1] 1725 >>> sum(topTable(fit2,coef=2,adjust="fdr",number=6000)[,"B"]>1) >> [1] 3438 >>> sum(topTable(fit2,coef=3,adjust="fdr",number=5000)[,"B"]>1) >> [1] 1512 >>> sum(topTable(fit2,coef=4,adjust="fdr",number=5000)[,"B"]>1) >> [1] 1725 >>> sum(topTable(fit2,coef=5,adjust="fdr",number=5000)[,"B"]>1) >> [1] 1605 >>> sum(topTable(fit2,coef=6,adjust="fdr",number=5000)[,"B"]>1) >> [1] 2610 >> >> Is more differential expression better .. always... I guess so >> unless >> there are more false positives right? I am slightly worried that in >> using a linear model to access array quality and produce weights , >> that >> this will then naturally bias a method such as limma that then >> using a >> linear model, again, to determine differential expression. After >> trying >> a few different permutations (use weights, remove "worst" arrays and >> redo without weights) that this is not a big concern but would >> welcome >> some feedback from others and insight into how they are using this >> function . >> >> Thanks >> Paul > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 440 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6