I am using neqc to normalize (bg correct, quantile normalize, and log2 transform) Illumina microarray data downloaded from GEO but am getting results that I am suspicious of. I do not have access to the negative control probe files but do have access to Detection P values (GSE39313 and GSE49000). After editing the raw non-normalized expression txt files so that "AVG_Signal" and "Detection" are included in the column names along with the sample names, I run the following:
data <- read.ilmn(data_paths, probeid="ID_REF") data <- neqc(data)$E
If I do a histogram of the now transformed values I get a bimodal distribution (a primary peak at log2 of ~ 6 and a secondary peak at log2 of ~14). I get a similar behaviour for both GSE's so I suspect it's not an oddity of the data. Additionally, after performing differential expression analysis I get a (suspiciously?) large number of differentially expressed genes with adjusted p-values > 0.05 and log-fold-change > 1 (~3000). Also, the top log-fold change values are also very high (10-14). If I import the data and just log2 transform (not using neqc), the data is approximately normally distributed (not bimodal). I suspect that neqc has done something strange and I'm hoping to figure out why so that I can still use it.
Any suggestions or thoughts?
I think I figured this one out (bi-modality is perhaps to be expected - see here). It seems to be quite a common observation with RMA normalized expression data (neqc is very similar to RMA).
No, bimodality isn't particularly expected nor is it very common.