missing value handling in limma
1
3
Entering edit mode
xiaocui zhu ▴ 70
@xiaocui-zhu-801
Last seen 9.6 years ago
Hi all, I recently used the linear model fit in limma to rank differentially expressed genes between treated vs. control with a data set. The data includes three log2(Treated/Control) replicate sets, and a dyeSwap for each replicate. So the design matrix is c(1,-1,1,-1,1-1). Among the top rank genes, I noticed some of them have only one log2Ratio measurement with the rest being "NA". I set the log2Ratio of a gene to "NA", if its green or red intensity measurement is below background, saturated, low intensity, or non-uniform. I am wondering how the linear model in limma handles missing values and why a gene with only one data point is identified as a high ranking differentially expressed gene. Thank you for your help in advance! Xiaocui [[alternative HTML version deleted]]
limma limma • 2.1k views
ADD COMMENT
1
Entering edit mode
@gordon-smyth
Last seen 50 minutes ago
WEHI, Melbourne, Australia

It is perfectly possible although very unlikely to a gene with only one non-missing value to be top-ranked (when analyzing two color microarray data). It would have to have an extraordinarily large fold change for this to happen.

limma handles missing values in the usual way for linear models at the lmFit() step. A gene with only one value will get df.residual=0. At the shrinkage step, the residual standard deviation for such a gene will be reset to the consensus value across all genes, and the corresponding degrees of freedom will be df.prior. This is explained in the article Smyth, SAGMB, 2004, cited in the documentation.

Gordon

PS. For a single channel technology, the gene would have to have 2 non-missing values before it could have a fold change and a p-value.

ADD COMMENT

Login before adding your answer.

Traffic: 689 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6