Search
Question: very high adjusted P-vals using Illumina data from GEO
1
2.8 years ago by
Jason20
Germany
Jason20 wrote:

Hello all,

as a disclaimer, I am new to array analysis, having started this week, however I cannot find a solution and suspect I must be doing something wrong, so I have resorted to making my first post.

My problem in particular is that I see abnormally high adjusted P-values in topTable, that look so unrealistic that I imagine it must be my mistake. It could be pure coincidence, but aside from the exercise vignettes, of the 6 analyses I have tried to perform, retrieving data from GEO (all are GSE identifiers) 3 have presented the problem and they are all Illumina data. I have not found worked examples of processing GSE data using Illumina, which is perhaps problematic as I could also not resort to processing from the original data. I have thus picked up vignettes from the point of having an expression set, set my model and contrast matrix, do the fit and then in my topTable nothing seems conclusive. From the snippet below, the adj.p has a minimum of 0.998 in one case and 0.119 in another.

What I'm using:

source("http://bioconductor.org/biocLite.R")
# install the GEO libraries
biocLite("Biobase")
biocLite("GEOquery")
biocLite("limma")
library(Biobase)
library(GEOquery)
library(limma)

# retrieve dataset
gset <- getGEO("GSE40458", GSEMatrix =TRUE)
gset <- gset[[1]]

# make proper column names to match toptable
fvarLabels(gset) <- make.names(fvarLabels(gset))

# group names for all samples
sml <- c("KDctrl","KD126","OE126","OEctrl","KDctrl","KD126","OE126","OEctrl",
"KDctrl","KD126","OE126","OEctrl")

# for analysis of this dataset, authors rejected any probe that did not meet a
# threshold of 7.8 across all measures for any condition. Screen the exprs for
# any such probes to reject
ex <- exprs(gset)
keep <- !apply(ex,1,function(x){TRUE %in% (x<7.8)})
exprs(gset) <- ex[keep,]

# set up the data and proceed with analysis
fl <- as.factor(sml)
gset$description <- fl # check assignments were done properly pData(gset)[,c(1,12,20)] # some visualization boxplot(exprs(gset)) plotMDS(gset,labels=gset$description)

# set design and perform fit
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)

# for contrasts, determine expression difference between treatment and
# corresponding control
cont.matrix <- makeContrasts(KD126="KD126-KDctrl",OE126="OE126-OEctrl",
levels=design)

fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
# peruse results
topTable(fit2,coef=2,adjust.method = "BH")[,c(1,13,31,34,35)]

from the diagnostic plots in the middle, the clustering seems a little all over the place. Boxplots look median distributed (mostly). In searching, I came across someone else who mentioned this issue Unexpectedly high FDR using contrasts in Limma? about 9 years ago, the resolution of which seemed to be that there wasn't a problem with the analysis. I wanted to check here though, as I get similar problems with GSE40189 and GSE69630. The latter two have not been published, the one described above has with biological conclusions drawn from the data. The authors used GeneSpring and an intensity cutoff (included in the code I'm using) but still, it doesn't look like I can conclude anything from either experiment. This was what they had to say:

Microarray

RNA from transduced human CB cells was extracted using Trizol (Invitrogen) and gene expression assayed on HT-12_v4 microarrays (Illumina). Quantile normalization was performed and probes were filtered by detection p-value (<=0.1) (GeneSpring GX, Agilent). Next, to remove uninformative probes, those that did not exceed a threshold of 7.8 in all replicates of any one condition were eliminated, leaving 15812 probes for analysis.

Am I doing something wrong? Is there a best practice for working with Illumina data when the raw data is not available? All the affymetrix GSE sets I've done have gone off without a hitch. I see in the GEO entry for the Illumina ones SOFT files, but again have found no workflow for working with them, the preference being for GSEMatrix. Thanks in advance for any suggestions.

Jason

R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] limma_3.26.5         GEOquery_2.36.0      Biobase_2.30.0
[4] BiocGenerics_0.16.1  BiocInstaller_1.20.1

loaded via a namespace (and not attached):
[1] tools_3.2.3    RCurl_1.95-4.7 bitops_1.0-6   XML_3.98-1.3  
modified 2.8 years ago by James W. MacDonald48k • written 2.8 years ago by Jason20
3
2.8 years ago by
Sean Davis21k
United States
Sean Davis21k wrote:

This is a really nice reproducible example.  Running it myself, it looks like there is little evidence of differential expression in the first contrast and perhaps 5 or six genes in the second contrast that could be considered significant (FDR<0.5, for example).  Unfortunately, this probably means that these experiments are not "positive".  You could try using gene set testing to look for subtle effects among larger groups of genes, but those types of results, even when positive, are often hard to interpret.

thanks - I leaned heavily on your GEOquery/GEOmetadb manual for sorting the different handling between GDS/GSE/GSM formats actually. While I'm glad I'm not completely out to lunch, I was hoping that maybe I *was* doing something glaringly obviously wrong that would be easily fixable. I'd've rather felt silly and have it fixed, than feel clever but the data be unusable :-)

3
2.8 years ago by
United States
James W. MacDonald48k wrote:

The only thing I see is that you mis-interpreted the filtering criteria. They required at least one group of samples to have an expression level > 7.8, not all groups. This is a biased way to filter, and if you want to do something like that you should instead use group size (e.g. 3 samples), rather than the groups themselves. The filtering method you use is particularly inefficient as well. Something like

ind <- rowSums(exprs(gset) > 7.8) >= 3

gset <- gset[ind,]

Is very fast, being vectorized, and will ensure that at least three samples per gene exceeded the cutoff, without specifying group membership (which biases the filtering towards your hypothesis).

If  you do a PCA plot of the data, you will see that all the sample types are in one big blob, rather than being separated into groups. This indicates that, at a high level, there are no glaringly large differences. Or alternatively, it might indicate that there are technical or biological confounders that are not being controlled for, although I don't see anything obvious. Either way, it's not surprising at all that you see no differences, given the sparse number of replicates and lack of separation on the first two PCs.

This is, in my experience, the norm rather than the exception. Often times people try to detect subtle differences with woefully underpowered experiments (funding constraints being what they are), resulting in no genes that are different enough to survive adjustments for multiplicity.