missing values in limma/contrasts.fit

0

Entering edit mode

Albyn Jones ▴ 70

@albyn-jones-3850

Last seen 9.6 years ago

Dear BioConductor Folk The help file for contrasts.fit states: "Warning. For efficiency reasons, this function does not re-factorize the design matrix for each probe. A consequence is that, if the design matrix is non-orthogonal and the original fit included quality weights or missing values, then the unscaled standard deviations produced by this function are approximate rather than exact. The approximation is usually acceptable...." My attention was attracted to the statement when a colleague in biology asked me why one would get different sets of probes identified as differentially expressed, depending on which individual or biological sample was selected as the reference in a balanced loop design. My experience, admittedly limited, suggests that the computational efficiency gain is not worth the loss of accuracy. Even if one has to sacrifice the efficiency of a single pass through the raw data, at least one gets correct results. I have hacked a version of lmFit to evaluate contrasts with standard errors based on the exact covariance matrix. It runs esssentially as quickly as lmFit, so I find the efficiency argument uncompelling. A search of the archive produced several discussions of missing values in limma. The main argument I see is Gordon Smyth's (Date: 2008-03-08) "The ideal solution is not to introduce missing values into your data in the first place. In my experimence, missing values are almost always avoidable. I have never seen a situation where it was necessary or desirable to introduce a large proportion of missing values." My colleagues in biology report that they inspect their arrays visually and note probes which have been scratched, probes covered by background blobs and the like. These categories seem to satisfy the missing-at-random criterion: the probe is marked NA not because it is saturated or below background, but because it was unreadable for reasons unrelated to the response. I'd appreciate feedback: has anyone else already done this? Would others find this useful? Are there objections I have overlooked? albyn

probe limma probe limma • 921 views

ADD COMMENT • link updated 14.4 years ago by Ramon Diaz ★ 1.1k • written 14.4 years ago by Albyn Jones ▴ 70

0

Entering edit mode

Ramon Diaz ★ 1.1k

@ramon-diaz-159

Last seen 9.6 years ago

Dear Albyn, On Monday 14 December 2009 20:15:05 Albyn Jones wrote: > Dear BioConductor Folk > > The help file for contrasts.fit states: > > "Warning. For efficiency reasons, this function does not > re-factorize the design matrix for each probe. A consequence is > that, if the design matrix is non-orthogonal and the original fit > included quality weights or missing values, then the unscaled > standard deviations produced by this function are approximate > rather than exact. The approximation is usually acceptable...." > > My attention was attracted to the statement when a colleague in > biology asked me why one would get different sets of probes identified > as differentially expressed, depending on which individual or > biological sample was selected as the reference in a balanced loop > design. > > My experience, admittedly limited, suggests that the computational > efficiency gain is not worth the loss of accuracy. Even if one has to > sacrifice the efficiency of a single pass through the raw data, at > least one gets correct results. I have hacked a version of lmFit to > evaluate contrasts with standard errors based on the exact covariance > matrix. It runs esssentially as quickly as lmFit, so I find the > efficiency argument uncompelling. > > A search of the archive produced several discussions of missing values > in limma. The main argument I see is Gordon Smyth's (Date: 2008-03-08) > > "The ideal solution is not to introduce missing values into your > data in the first place. In my experimence, missing values are > almost always avoidable. I have never seen a situation where it > was necessary or desirable to introduce a large proportion of > missing values." > > My colleagues in biology report that they inspect their arrays > visually and note probes which have been scratched, probes covered by > background blobs and the like. These categories seem to satisfy the That is exactly the same here with many of my colleagues. > missing-at-random criterion: the probe is marked NA not because it is > saturated or below background, but because it was unreadable for > reasons unrelated to the response. Yes. And those NAs are, then, truly NAs. Even if they were not CMAR or MAR, they are NAs nonetheless. However, it is also the case that these "true NAs" are only a very minor fraction of the total number of points. > > I'd appreciate feedback: has anyone else already done this? Would > others find this useful? Are there objections I have overlooked? > Yes, I'd find it useful. Best, R. > albyn > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Ramon Diaz-Uriarte Biocomputing Programme Centro Nacional de Investigaciones Oncol?gicas (CNIO) (Spanish National Cancer Center) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://ligarto.org/rdiaz **NOTA DE CONFIDENCIALIDAD** Este correo electr?nico, y ...{{dropped:3}}

ADD COMMENT • link 14.4 years ago Ramon Diaz ★ 1.1k

Login before adding your answer.