Question

missing values in limma/contrasts.fit

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 7 minutes ago

WEHI, Melbourne, Australia

Dear Albyn, It is always helpful to give a code example, preferably something which can be run by others, but at least showing us the details of the situation. There isn't any way to confirm from your email that the issue you mention is the root cause of the problem. Your colleague's data example would have to quite unusual for this to be true. I have plenty of sympathy with researchers carefully examining the microarray image and marking scratches or other imperfections, but it is hard to imagine that they would flag tens of thousands of spots in this way and, if they did, it might be better to drop the array entirely. lmFit() already computes exact standard errors. In many large data problems, in which many contrasts might be experimented with, it is convenient to be able to compute a basic fit object and then to apply various contrasts posthoc. This was especially true in the early days of microarray analysis with somewhat slower computers than today. This is the reason for the separate contrasts.fit() function. I made an early design decision that fit objects could be order ngenes*nparameters in size, but not order ngenes*nparameters^2, so it was not possible to store genewise QR decompositions. This leads to the computational efficiency issue, which applies to contrasts.fit, not to lmFit. If a dataset has so many spot-specific weights of different sizes, so that the approximation involved in contrasts.fit() is not trustworthy, then it is simple enough to run lmFit() a couple of times with different parametrizations to get all the contrasts desired. I have considered in the past adding a 'contrasts' argument to lmFit, so that users could fit the basic regression and simultaneously extract all desired contrasts, using the exact covariance matrices. (This is not quite as simple as you might think, because lmFit supports a number of different covariance structures, robust fits etc, so the changes have to be made in a number of low level functions. And it changes the structure of the fit object which is passed on.) I can still do so if there is a demand for it. Best wishes Gordon > Date: Mon, 14 Dec 2009 11:15:05 -0800 > From: Albyn Jones <jones at="" reed.edu=""> > Subject: [BioC] missing values in limma/contrasts.fit > To: bioconductor at stat.math.ethz.ch > Message-ID: <20091214191505.GD11908 at laplace.reed.edu> > Content-Type: text/plain; charset=us-ascii > > Dear BioConductor Folk > > The help file for contrasts.fit states: > > "Warning. For efficiency reasons, this function does not > re-factorize the design matrix for each probe. A consequence is > that, if the design matrix is non-orthogonal and the original fit > included quality weights or missing values, then the unscaled > standard deviations produced by this function are approximate > rather than exact. The approximation is usually acceptable...." > > My attention was attracted to the statement when a colleague in > biology asked me why one would get different sets of probes identified > as differentially expressed, depending on which individual or > biological sample was selected as the reference in a balanced loop > design. > > My experience, admittedly limited, suggests that the computational > efficiency gain is not worth the loss of accuracy. Even if one has to > sacrifice the efficiency of a single pass through the raw data, at > least one gets correct results. I have hacked a version of lmFit to > evaluate contrasts with standard errors based on the exact covariance > matrix. It runs esssentially as quickly as lmFit, so I find the > efficiency argument uncompelling. > > A search of the archive produced several discussions of missing values > in limma. The main argument I see is Gordon Smyth's (Date: 2008-03-08) > > "The ideal solution is not to introduce missing values into your > data in the first place. In my experimence, missing values are > almost always avoidable. I have never seen a situation where it > was necessary or desirable to introduce a large proportion of > missing values." > > My colleagues in biology report that they inspect their arrays > visually and note probes which have been scratched, probes covered by > background blobs and the like. These categories seem to satisfy the > missing-at-random criterion: the probe is marked NA not because it is > saturated or below background, but because it was unreadable for > reasons unrelated to the response. > > I'd appreciate feedback: has anyone else already done this? Would > others find this useful? Are there objections I have overlooked? > > albyn ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

Microarray Regression probe limma Microarray Regression probe limma • 888 views

ADD COMMENT • link updated 14.4 years ago by Albyn Jones ▴ 70 • written 14.4 years ago by Gordon Smyth 50k

score 0 · Answer 1 · 2009-12-16

Gordon Thanks for the toughtful response. I should have included an example; I will put a small one together soon. I am inmeshed in giving and grading exams at the moment, so it won't happen immediately. albyn On Wed, Dec 16, 2009 at 10:09:59AM +1100, Gordon K Smyth wrote: > Dear Albyn, > > It is always helpful to give a code example, preferably something which can > be run by others, but at least showing us the details of the situation. > There isn't any way to confirm from your email that the issue you mention > is the root cause of the problem. Your colleague's data example would have > to quite unusual for this to be true. > > I have plenty of sympathy with researchers carefully examining the > microarray image and marking scratches or other imperfections, but it is > hard to imagine that they would flag tens of thousands of spots in this way > and, if they did, it might be better to drop the array entirely. > > lmFit() already computes exact standard errors. > > In many large data problems, in which many contrasts might be experimented > with, it is convenient to be able to compute a basic fit object and then to > apply various contrasts posthoc. This was especially true in the early > days of microarray analysis with somewhat slower computers than today. This > is the reason for the separate contrasts.fit() function. > > I made an early design decision that fit objects could be order > ngenes*nparameters in size, but not order ngenes*nparameters^2, so it was > not possible to store genewise QR decompositions. This leads to the > computational efficiency issue, which applies to contrasts.fit, not to > lmFit. > > If a dataset has so many spot-specific weights of different sizes, so that > the approximation involved in contrasts.fit() is not trustworthy, then it > is simple enough to run lmFit() a couple of times with different > parametrizations to get all the contrasts desired. > > I have considered in the past adding a 'contrasts' argument to lmFit, so > that users could fit the basic regression and simultaneously extract all > desired contrasts, using the exact covariance matrices. (This is not quite > as simple as you might think, because lmFit supports a number of different > covariance structures, robust fits etc, so the changes have to be made in a > number of low level functions. And it changes the structure of the fit > object which is passed on.) I can still do so if there is a demand for it. > > Best wishes > Gordon > >> Date: Mon, 14 Dec 2009 11:15:05 -0800 >> From: Albyn Jones <jones at="" reed.edu=""> >> Subject: [BioC] missing values in limma/contrasts.fit >> To: bioconductor at stat.math.ethz.ch >> Message-ID: <20091214191505.GD11908 at laplace.reed.edu> >> Content-Type: text/plain; charset=us-ascii >> >> Dear BioConductor Folk >> >> The help file for contrasts.fit states: >> >> "Warning. For efficiency reasons, this function does not >> re-factorize the design matrix for each probe. A consequence is >> that, if the design matrix is non-orthogonal and the original fit >> included quality weights or missing values, then the unscaled >> standard deviations produced by this function are approximate >> rather than exact. The approximation is usually acceptable...." >> >> My attention was attracted to the statement when a colleague in >> biology asked me why one would get different sets of probes identified >> as differentially expressed, depending on which individual or >> biological sample was selected as the reference in a balanced loop >> design. >> >> My experience, admittedly limited, suggests that the computational >> efficiency gain is not worth the loss of accuracy. Even if one has to >> sacrifice the efficiency of a single pass through the raw data, at >> least one gets correct results. I have hacked a version of lmFit to >> evaluate contrasts with standard errors based on the exact covariance >> matrix. It runs esssentially as quickly as lmFit, so I find the >> efficiency argument uncompelling. >> >> A search of the archive produced several discussions of missing values >> in limma. The main argument I see is Gordon Smyth's (Date: 2008-03-08) >> >> "The ideal solution is not to introduce missing values into your >> data in the first place. In my experimence, missing values are >> almost always avoidable. I have never seen a situation where it >> was necessary or desirable to introduce a large proportion of >> missing values." >> >> My colleagues in biology report that they inspect their arrays >> visually and note probes which have been scratched, probes covered by >> background blobs and the like. These categories seem to satisfy the >> missing-at-random criterion: the probe is marked NA not because it is >> saturated or below background, but because it was unreadable for >> reasons unrelated to the response. >> >> I'd appreciate feedback: has anyone else already done this? Would >> others find this useful? Are there objections I have overlooked? >> >> albyn > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:5}}