limma question: direct two-color design & modeling individual subject effects
2
0
Entering edit mode
@gordon-smyth
Last seen 27 minutes ago
WEHI, Melbourne, Australia
Dear Paul, Your description of the limma model you've fitted is very clear, but you haven't explained exactly what is in your picture. The data values on the y-axis don't appear to be the M-values you used to fit the linear model, because we don't see the up-down pattern we'd expect to see from dye-swaps. How have you obtained "fitted values"? Note that M-values are already log-ratios, so it doesn't make sense to write "log2M". lmFit() simply does least squares regression. It gives the same coefficients that you would get from lm() for each gene. I suggest that you extract the M-value data for one gene, and experiment with fitting the data using lm(), until you're satisfied that you understand the parametrization and fitted values. Best wishes Gordon > [BioC] limma question: direct two-color design & modeling individual subject effects > Paul Shannon pshannon at systemsbiology.org > Mon Apr 30 05:08:15 CEST 2007 > > I've been working on and off for a few months with limma on a set of 28 2-color > arrays made up of 14 dye-swap pairs. The main contrast in the arrays is between > malaria parasite RNA extracted from maternal and from juvenile hosts; > all the arrays can be described in these terms. This is the main effect we > are studying, and limma is very helpful in elucidating it. > > The arrays can be more specifically described as comparisons between specific > maternal subjects and specific juvenile subjects -- between different > combinations of three mothers (m918, m836, m920) with six children (c073, c135, > c140, c372, c451, c413, c425). I have trouble fitting models to some of these > genes, failing to isolatethe effects of individual subjects where their effects seem > to be strong. > > (A good example can be seen at http://gaggle.systemsbiology.net/pshannon/tmp/7346.png, > where the effect of m920 is pronounced, but apparently missed by my lmFit/eBayes model.) > > Here are some few lines from each of the matrices I use that lead to that plot. > > ---- head (targets) > > SlideNumber Name FileName Cy3 Cy5 Mother Child > 1 2254 slide2254 m918c073-cy3cy5.gpr maternal juvenile m918 c073 > 2 2261 slide2261 m918c073-cy5cy3.gpr juvenile maternal m918 c073 > 3 2258 slide2258 m836c073-cy3cy5.gpr maternal juvenile m836 c073 > 4 2265 slide2265 m836c073-cy5cy3.gpr juvenile maternal m836 c073 > 5 2341 slide2341 m836c135-cy3cy5.gpr maternal juvenile m836 c135 > 6 2344 slide2344 m836c135-cy5cy3.gpr juvenile maternal m836 c135 > > ----- head (design) > > mother child maternal > 1 m918 c073 Low > 2 m918 c073 High > 3 m836 c073 Low > 4 m836 c073 High > 5 m836 c135 Low > 6 m836 c135 High > > ---- create the model > > model <- model.matrix (~maternal + mother + child, design) > > head (model) > (Intercept) maternalHigh motherm918 motherm920 childc135 childc140 childc372 childc413 childc425 childc451 > 1 1 0 1 0 0 0 0 0 0 0 > 2 1 1 1 0 0 0 0 0 0 0 > 3 1 0 0 0 0 0 0 0 0 0 > 4 1 1 0 0 0 0 0 0 0 0 > 5 1 0 0 0 1 0 0 0 0 0 > 6 1 1 0 0 1 0 0 0 0 0 > > ---- fit the data > > fit <- lmFit (MA, model) > efit <- eBayes (fit) > > # one example of poor fit. with probe 7346, the m920 effect is very strong, but the coefficients > # don't reflect that. instead, most of the influence is allocated to the maternal effect, which > # nicely models all the comparisons except those involving m920. the fit there is strikingly > # poor, with high residuals. I can't make sense of the tiny motherm920 coefficient: > > > efit$coef [7346,] > (Intercept) maternalHigh motherm918 motherm920 childc135 childc140 childc372 childc413 childc425 childc451 > -3.62867124 7.49268173 0.24858455 -0.02635289 -0.67898282 -0.24566235 -0.24673763 0.10618603 -0.37520911 -0.02761610 > > The plot of the fitted & actual values can be found at > > http://gaggle.systemsbiology.net/pshannon/tmp/7346.png > > I may be over-interpreting, or mis-interpreting, or even misrepresenting all this. But after lots > of head scratching, lots of reading and experiments, I can't get the coefficients to do what I think > they should. Perhaps it's my failure to use a contrast matrix. Or something else. > > Any suggestions? I'll be really grateful for any advice. > > Thanks! > > - Paul
SystemsBiology Regression probe limma SystemsBiology Regression probe limma • 1.2k views
ADD COMMENT
0
Entering edit mode
Lev Soinov ▴ 470
@lev-soinov-2119
Last seen 9.6 years ago
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070501/ 0b9064db/attachment.pl
ADD COMMENT
0
Entering edit mode
Hi Lev, Lev Soinov wrote: > Dear List, It was a lot of e-mails from me recently for which I > apologise and hope that my questions were of interest to you. I would > be very much interested in your views on the following. It is > accepted that microarray measurements contain technical and > biological variations and that for single channel (e.g. Affy) > normalised log2 data biological variation is ~2-4 times higher than > technical. I'm not sure I would agree with that. The technical variability is highly correlated with the abilities of the lab techs who do the RNA extraction and the Affy molecular biology steps, and there is a huge range of abilities in the world of lab techs. In addition, biological variability is dependent on what system you are considering. If for instance you take some cell lines and pour bleach on one set and leave the other alone, then you are correct. The biological variability will likely be larger than technical varibility. On the other hand, I have seen plenty of experiments where the biological variability was much smaller than the technical variability, and as a result the number of 'significant' genes was less than you would expect by chance, given no difference between the sample types. Now, let's say we have N arrays representing N separate > individuals (or animals) of a certain type (wild type or under the > same treatment condition). Thus, for a given gene, variations in its > expression level represent variations across individuals. Suppose > that for each gene we calculate its mean expression level across all > arrays (across population) and then in each array classify genes as > expressed above/below their average across population levels. Now for > each gene we have a binary profile (above/below) across population of > N individuals. We can further perform say simple correlation analysis > and find those genes which concurrently expressed above/below their > population means within the given set of N arrays. This may provide > some information on their connectivity, etc. What would you say about > this? Is it a valid suggestion? Or the technical component of > variation will not allow doing this? If it were relevant, what would > be the sample size sufficient for finding accurate correlations? Do > you know of any Bioconductor tools that may be applied here for > sample size calculations? Looking forward to hearing your opinions, I don't like the idea of reducing data to a binary distribution without a clear reason to do so. For instance, some classifiers work better if you can reduce the phenotypes to a binary outcome (e.g., good/bad survival outcome). However, by reducing to a binary distribution you are throwing out (possibly valuable) information, so I think it is reasonable to show why using the continuous distribution of data you started with isn't workable. If you want to detect connectivity of genes, you might want to look at the work Steve Horvath has done: http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/ Best, Jim > Lev. > > --------------------------------- > > [[alternative HTML version deleted]] > > _______________________________________________ Bioconductor mailing > list Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the > archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.
ADD REPLY
0
Entering edit mode
Paul Shannon ★ 1.1k
@paul-shannon-578
Last seen 9.6 years ago
Hi Gordon, Thanks very much for your reply. Your suggestion makes sense -- that I should explore the use of lm () on one gene's MA$M values, until I can fit the experimental factors I care about, and only then return to limma. When I do get that far, I'll probably be back with another question or two. :} You asked a couple of questions about some confusing aspects of my plot at http://gaggle/pshannon/tmp/7346.png (which I have since revised): > ... you haven't explained exactly what is in your picture. The > data values on the y-axis don't appear to be the M-values you used > to fit the linear model, because we don't see the up-down pattern > we'd expect to see from dye-swaps. I failed to explain that I multiplied all of the measured dye-swaps (cy5/cy3) by -1 so that their ratios would be easy to compare to the (cy3/cy5) ratios. > How have you obtained "fitted values"? I multiplied the model by the fitted coefficients, summed the resultant vector, and corrected for dye swap: # corrector = 1 or -1 corrector * sum (model [slide,] * efit$coef [row,])) > Note that M-values are already log-ratios, so it doesn't make sense > to write "log2M". Good point... Thanks again for your reply! I'd reached the end of my own resources; now I can get to work again. Cheers, - Paul > > lmFit() simply does least squares regression. It gives the same coefficients that you would get > from lm() for each gene. I suggest that you extract the M-value data for one gene, and experiment > with fitting the data using lm(), until you're satisfied that you understand the parametrization > and fitted values.
ADD COMMENT
0
Entering edit mode
Dear Paul, It may not affect your plot much, but be aware that the model you fitted includes a probe-specific dye-effect (the intercept term), so that dye-swapping is not exactly equivalent to multiplying by -1 according to the model. This doesn't change the fact that the fitted values should follow the data values in your plot. Best wishes Gordon At 09:44 AM 2/05/2007, Paul Shannon wrote: >I failed to explain that I multiplied all of the measured dye-swaps >(cy5/cy3) by -1 so that their ratios would be easy to compare to the >(cy3/cy5) ratios. > > > How have you obtained "fitted values"? > >I multiplied the model by the fitted coefficients, summed the >resultant vector, >and corrected for dye swap: > > # corrector = 1 or -1 > corrector * sum (model [slide,] * efit$coef [row,]))
ADD REPLY
0
Entering edit mode
At 09:44 AM 2/05/2007, Paul Shannon wrote: > > How have you obtained "fitted values"? > >I multiplied the model by the fitted coefficients, summed the >resultant vector, >and corrected for dye swap: > > # corrector = 1 or -1 > corrector * sum (model [slide,] * efit$coef [row,])) Note that in limma fv <- fitted(fit) also works. It gives a matrix of fitted values. Best wishes Gordon
ADD REPLY

Login before adding your answer.

Traffic: 647 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6