limma eBayes: how to determine goodness of fit?

1

Entering edit mode

Mark Robinson ★ 1.1k

@mark-robinson-2171

Last seen 9.6 years ago

Hi Paul. Some are in "MArrayLM" object, others require a couple commands. For example, using the data example in lmFit help: ------------ sd <- 0.3*sqrt(4/rchisq(100,df=4)) y <- matrix(rnorm(100*6,sd=sd),100,6) rownames(y) <- paste("Gene",1:100) y[1:2,4:6] <- y[1:2,4:6] + 2 design <- cbind(Grp1=1,Grp2vs1=c(0,0,0,1,1,1)) # Ordinary fit fit <- lmFit(y,design) fit <- eBayes(fit) ------------ you get recreate what you want in the linear model by the following 1) Residual standard error > lm.first<-lm(y[1,]~-1+design) > sqrt(anova(lm.first)["Residuals","Mean Sq"]) [1] 0.2990779 > fit$sigma[1] Gene 1 0.2990779 2) R-Squared > sst<-rowSums(y^2) > ssr<-sst-fit$df.residual*fit$sigma^2 > (ssr/sst)[1] Gene 1 0.964451 > summary(lm.first)$r.squared [1] 0.964451 ... 4) The F-statistic will change since the variances are moderated, causing both the statistic to change and the degrees of freedom to change. Cheers, Mark On 01/06/2007, at 5:38 AM, Paul Shannon wrote: > A summary of an lm result includes some readily-understood > goodness of fit information: > > 1) Residual standard error > 2) Multiple R-Squared > 3) Adjusted R-Squared > 4) F-statistic > > With limma, and eBayes, I deduce (incorrectly?) that efit$F and efit > $F.p.value > convey information similar to number 4 above. How about the first > three > measures -- is there any way to get equivalent information for the > linear > model eFit produces? > > Thanks! > > - Paul

limma limma • 4.4k views

ADD COMMENT • link updated 16.9 years ago by Paul Shannon ★ 1.1k • written 16.9 years ago by Mark Robinson ★ 1.1k

0

Entering edit mode

Paul Shannon ★ 1.1k

@paul-shannon-578

Last seen 9.6 years ago

Hi Mark, Thanks for the very helpful explanation and -- especially -- the example code. Two questions remain: 1) You say, "The F-statistic will change since the variances are moderated, causing both the statistic to change and the degrees of freedom to change." I am afraid I don't understand this. In what circumstances does the F-statistic change? 2) I am interested in these goodness-of-fit measures because I want to search through my data (and the eBayes output) to find genes whose behavior is very nicely modeled by different coefficients, like this a) devise a variety of model formualae based on biological intuition and examinations of the data b) fit those models, one at a time c) identify different subsets of genes based on their fit to each formula I remain a bit puzzled by the absence of these statistics from the eBayes output. Does their absence suggest that my 3-step procedure (2a-c, above) is not common practice? And would _that_ suggest that my 3-step procedure is not such a great idea? Are there better, more commonly used ways in limma to look for particular effects and influences of the experimental factors? I'll be grateful for any comments or suggestions. Cheers, - Paul

ADD COMMENT • link 16.9 years ago Paul Shannon ★ 1.1k

0

Entering edit mode

Hi Paul. > 1) You say, "The F-statistic will change since the variances are > moderated, > causing both the statistic to change and the degrees of freedom to > change." > I am afraid I don't understand this. In what circumstances does the > F-statistic > change? What I mean is that the F statistic you get from limma is moderated and is different from the *classical* F test (which is what you'd get from the 'lm' or 'anova' commands in R). Have a read of Gordon's paper for the details: http://www.bepress.com/sagmb/vol3/iss1/art3/ > 2) I am interested in these goodness-of-fit measures because I want to > search > through my data (and the eBayes output) to find genes whose behavior is > very nicely modeled by different coefficients, like this > > a) devise a variety of model formualae based on biological > intuition > and examinations of the data > b) fit those models, one at a time > c) identify different subsets of genes based on their fit to each > formula Can you give an example of what you mean here? I'm not sure what "Very nicely modeled by different coefficients" and "variety of model formulae based on biological intuition" are really referring to. Do you have a designed experiment where you wish to look for differences between, say treatments A, B, C, etc.? > I remain a bit puzzled by the absence of these statistics from the eBayes > output. Does > their absence suggest that my 3-step procedure (2a-c, above) is not common > practice? And > would _that_ suggest that my 3-step procedure is not such a great idea? > Are there better, > more commonly used ways in limma to look for particular effects and > influences > of the experimental factors? Common practice would be to make contrasts to the effect you want to assess the significance of (if I understand what you are asking). There are several examples in the limma user's guide: http://bioconductor.org/packages/2.0/bioc/html/limma.html Cheers, Mark

ADD REPLY • link 16.9 years ago Mark Robinson ★ 1.1k

0

Entering edit mode

Paul Shannon ★ 1.1k

@paul-shannon-578

Last seen 9.6 years ago

Hi Mark, Thanks very much for your reply of a couple of days ago. You kindly ask: > Can you give an example of what you mean here? I'm not sure what "Very > nicely modeled by different coefficients" and "variety of model formulae > based on biological intuition" are really referring to. Do you have a > designed experiment where you wish to look for differences between, say > treatments A, B, C, etc.? In brief: with lm (which I turned to at Gordon's suggestion) I can trawl my data for specific expression patterns, using different model formulae and stringent goodness-of-fit measures. But I am stumped when I try to do the same with limma. I'd like to benefit from the refinements offered by limma, eBayes, in particular. Here's the longer story. The experiment is a direct 2-color design. There are 9 reference samples (call them J1-J9) and 3 'signal' samples (M1-M3). A brief description of the actual study may be helpful: Malaria parasites exhibit a 'binding phenotype' when infecting primigravida (women pregnant for the first time). The parasites attach to CSA exposed in the placental lining and cause lots of harm to mother and fetus. We hope to elucidate the mechanisms of that binding by finding genes strongly up-regulated in primigravida. So our basic question is: what genes behave markedly differently in primigravid parasites when compared to juvenile parasites? The 28 microarray slides consist of 14 dye-swap pairs, for a total of 28 slides: M J --- ---------- 1 1 2 3 4 5 2 1 3 5 6 7 9 3 4 7 8 My simplest linear model treats every M/J comparison identically. And in this simple comparison we found two genes (call them gene1 and gene2), each with a logFC ratio ~= 6, and with very small residuals. A great signal: these genes were uniformly up-regulated in all comparisons. I then examined genes a little further down in the limma topTable results. I found a gene3 whose statistics weren't bad (adj.P.Val=e-7, B=7.9), but a plot showed a very interesting pattern: like gene1 and gene2, in had a great signal in all comparisons except those involving M3. This doesn't seem like random variation; this seems like another gene which might be contributing (in 2 primigravida out of 3) to the binding phenotype. So here's where my biological intuition come in: multiple mechanisms may be involved in CSA binding. gene1 and gene2 are perhaps always involved. Parasites from M3 do not seem to require gene3 to bind to the placenta, but maybe M1 and M2 do. My evidence so far is scant, but I wish to explore the data to test the hypothesis. I followed Gordon suggestion and used bare-bones lm () to model each row in my MA object. I was happy to see that I found small residuals, and a high R-squared when I modeled gene3 like this: gene3.lm <- lm (ratio ~ 1 + M3) My simpler model for gene1 and gene2 would be simply gene12.lm <- lm (ratio ~ 1) Now I want to explore all the genes in the 28 comparisons to find (possibly) other striking and biologically suggestive behavior. Perhaps there are good R-squared values for lm (ratio ~ 1 + M1) or lm (ratio ~ 1 + M2) which might (if we're lucky) reveal other pieces of the multiple, conditional contributors to the binding phenotype. I'd like to do this sort of exploring in limma itself, rather than stick with what is, for me, the more transparent but less rich lm. I sense that a judicious use of makeContrasts, contrast.fit, and decideTests may allow me to do this kind of exploration. But I pore over the manual, experiment for hours, and (it's embarrassing, but true) end up confused and feeling rather a dunce. Got any advice? Cheers! - Paul

ADD COMMENT • link 16.9 years ago Paul Shannon ★ 1.1k

0

Entering edit mode

Hello, Has anybody used the Windows XP Professional x64 Edition http://www.microsoft.com/windowsxp/64bit/overview.mspx to run the BioConductor packages? If so are there any configurations that one must be concerned about? Any advice is appreciated. Thanks, Sandhya

ADD REPLY • link 16.9 years ago Xirasagar, Sandhya ▴ 20

0

Entering edit mode

Hi Sandhya, We've used R/bioconductor on XP 64bit and things work just fine. Keep in mind that this will be a 32bit version of R. There are currently no tools that can compile a 64 bit version of R on windows. This means you will not be able to take advantage of the speed and available memory increase of a 64-bit machine. Also, please do not use an existing message to reply to when posting on the list. Some email clients treat that as an actual reply and it also messes up the list archives. You should create a new message instead. On Tue, 2007-06-05 at 16:36 -0400, Xirasagar, Sandhya wrote: > Hello, > > Has anybody used the Windows XP Professional x64 Edition > http://www.microsoft.com/windowsxp/64bit/overview.mspx > to run the BioConductor packages? > > If so are there any configurations that one must be concerned about? Any > advice is appreciated. > > Thanks, > > Sandhya > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 16.9 years ago Francois Pepin ★ 1.3k

0

Entering edit mode

Thanks for this helpful information and also advice on not using an existing message to reply to post a message. I was not aware of the repercussions. Sandhya -----Original Message----- From: Francois Pepin [mailto:fpepin@cs.mcgill.ca] Sent: Wednesday, June 06, 2007 10:46 AM To: Xirasagar, Sandhya Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] Windows XP Professional x64 Edition Hi Sandhya, We've used R/bioconductor on XP 64bit and things work just fine. Keep in mind that this will be a 32bit version of R. There are currently no tools that can compile a 64 bit version of R on windows. This means you will not be able to take advantage of the speed and available memory increase of a 64-bit machine. Also, please do not use an existing message to reply to when posting on the list. Some email clients treat that as an actual reply and it also messes up the list archives. You should create a new message instead. On Tue, 2007-06-05 at 16:36 -0400, Xirasagar, Sandhya wrote: > Hello, > > Has anybody used the Windows XP Professional x64 Edition > http://www.microsoft.com/windowsxp/64bit/overview.mspx > to run the BioConductor packages? > > If so are there any configurations that one must be concerned about? Any > advice is appreciated. > > Thanks, > > Sandhya > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 16.9 years ago Xirasagar, Sandhya ▴ 20

0

Entering edit mode

Hi Paul. Thanks for the explanation, I see what you are doing. I would call what you are doing 'model selection' and its true, limma doesn't do that. But, you can probably fit a 'full' model and just look for the differences that you describe, all within limma. For example, you could fit a model: ~ -1 + M1 + M2 + M3 where M1-M3 are binary indicators of the signal samples. So this fits a model with no intercept and a different mean for each 'signal' sample. If I understand your problem correctly, you are looking for genes that are non-zero in each of coefficients for M1-M3 (like you gene1 and gene2). But, you are also interested in genes which have non-zero in M1,M2 and maybe not so in M3 (your gene3). These are just contrasts. So, you should be able to look for everything you are interested in, by constructing contrasts on M1-M3. Alternatively, you could fit all the possible models you are interested in and filter all the topTables. There are not that many. Just one other note .... > I was happy to see that I found small residuals, and a high R- > squared when I modeled gene3 > like this: I think you'll find that small residuals (or at least small relative to the signal) and high R-squares correspond to large (in magnitude) t-statistic or large Fs. So, everything you need is in limma. Hope that helps. Cheers, Mark

ADD REPLY • link 16.9 years ago Mark Robinson ★ 1.1k

Login before adding your answer.