Hello all,
as a disclaimer, I am new to array analysis, having started this week, however I cannot find a solution and suspect I must be doing something wrong, so I have resorted to making my first post.
My problem in particular is that I see abnormally high adjusted P-values in topTable, that look so unrealistic that I imagine it must be my mistake. It could be pure coincidence, but aside from the exercise vignettes, of the 6 analyses I have tried to perform, retrieving data from GEO (all are GSE identifiers) 3 have presented the problem and they are all Illumina data. I have not found worked examples of processing GSE data using Illumina, which is perhaps problematic as I could also not resort to processing from the original data. I have thus picked up vignettes from the point of having an expression set, set my model and contrast matrix, do the fit and then in my topTable nothing seems conclusive. From the snippet below, the adj.p has a minimum of 0.998 in one case and 0.119 in another.
What I'm using:
source("http://bioconductor.org/biocLite.R")
# install the GEO libraries
biocLite("Biobase")
biocLite("GEOquery")
biocLite("limma")
library(Biobase)
library(GEOquery)
library(limma)
# retrieve dataset
gset <- getGEO("GSE40458", GSEMatrix =TRUE)
gset <- gset[[1]]
# make proper column names to match toptable
fvarLabels(gset) <- make.names(fvarLabels(gset))
# group names for all samples
sml <- c("KDctrl","KD126","OE126","OEctrl","KDctrl","KD126","OE126","OEctrl",
"KDctrl","KD126","OE126","OEctrl")
# for analysis of this dataset, authors rejected any probe that did not meet a
# threshold of 7.8 across all measures for any condition. Screen the exprs for
# any such probes to reject
ex <- exprs(gset)
keep <- !apply(ex,1,function(x){TRUE %in% (x<7.8)})
exprs(gset) <- ex[keep,]
# set up the data and proceed with analysis
fl <- as.factor(sml)
gset$description <- fl
# check assignments were done properly
pData(gset)[,c(1,12,20)]
# some visualization
boxplot(exprs(gset))
plotMDS(gset,labels=gset$description)
# set design and perform fit
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)
# for contrasts, determine expression difference between treatment and
# corresponding control
cont.matrix <- makeContrasts(KD126="KD126-KDctrl",OE126="OE126-OEctrl",
levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
# peruse results
topTable(fit2,coef=1,adjust.method = "BH")[,c(1,13,31,34,35)]
topTable(fit2,coef=2,adjust.method = "BH")[,c(1,13,31,34,35)]
from the diagnostic plots in the middle, the clustering seems a little all over the place. Boxplots look median distributed (mostly). In searching, I came across someone else who mentioned this issue Unexpectedly high FDR using contrasts in Limma? about 9 years ago, the resolution of which seemed to be that there wasn't a problem with the analysis. I wanted to check here though, as I get similar problems with GSE40189 and GSE69630. The latter two have not been published, the one described above has with biological conclusions drawn from the data. The authors used GeneSpring and an intensity cutoff (included in the code I'm using) but still, it doesn't look like I can conclude anything from either experiment. This was what they had to say:
Microarray
RNA from transduced human CB cells was extracted using Trizol (Invitrogen) and gene expression assayed on HT-12_v4 microarrays (Illumina). Quantile normalization was performed and probes were filtered by detection p-value (<=0.1) (GeneSpring GX, Agilent). Next, to remove uninformative probes, those that did not exceed a threshold of 7.8 in all replicates of any one condition were eliminated, leaving 15812 probes for analysis.
Am I doing something wrong? Is there a best practice for working with Illumina data when the raw data is not available? All the affymetrix GSE sets I've done have gone off without a hitch. I see in the GEO entry for the Illumina ones SOFT files, but again have found no workflow for working with them, the preference being for GSEMatrix. Thanks in advance for any suggestions.
Jason
R version 3.2.3 (2015-12-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) locale: [1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C [5] LC_TIME=English_Canada.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] limma_3.26.5 GEOquery_2.36.0 Biobase_2.30.0 [4] BiocGenerics_0.16.1 BiocInstaller_1.20.1 loaded via a namespace (and not attached): [1] tools_3.2.3 RCurl_1.95-4.7 bitops_1.0-6 XML_3.98-1.3
thanks - I leaned heavily on your GEOquery/GEOmetadb manual for sorting the different handling between GDS/GSE/GSM formats actually. While I'm glad I'm not completely out to lunch, I was hoping that maybe I *was* doing something glaringly obviously wrong that would be easily fixable. I'd've rather felt silly and have it fixed, than feel clever but the data be unusable :-)