very high adjusted P-vals using Illumina data from GEO
2
1
Entering edit mode
Jason ▴ 20
@jason-9567
Last seen 8.0 years ago
Germany

Hello all,

as a disclaimer, I am new to array analysis, having started this week, however I cannot find a solution and suspect I must be doing something wrong, so I have resorted to making my first post.

My problem in particular is that I see abnormally high adjusted P-values in topTable, that look so unrealistic that I imagine it must be my mistake. It could be pure coincidence, but aside from the exercise vignettes, of the 6 analyses I have tried to perform, retrieving data from GEO (all are GSE identifiers) 3 have presented the problem and they are all Illumina data. I have not found worked examples of processing GSE data using Illumina, which is perhaps problematic as I could also not resort to processing from the original data. I have thus picked up vignettes from the point of having an expression set, set my model and contrast matrix, do the fit and then in my topTable nothing seems conclusive. From the snippet below, the adj.p has a minimum of 0.998 in one case and 0.119 in another.

What I'm using:

source("http://bioconductor.org/biocLite.R")
# install the GEO libraries
biocLite("Biobase")
biocLite("GEOquery")
biocLite("limma")
library(Biobase)
library(GEOquery)
library(limma)

# retrieve dataset
gset <- getGEO("GSE40458", GSEMatrix =TRUE)
gset <- gset[[1]]

# make proper column names to match toptable
fvarLabels(gset) <- make.names(fvarLabels(gset))

# group names for all samples
sml <- c("KDctrl","KD126","OE126","OEctrl","KDctrl","KD126","OE126","OEctrl",
         "KDctrl","KD126","OE126","OEctrl")

# for analysis of this dataset, authors rejected any probe that did not meet a
# threshold of 7.8 across all measures for any condition. Screen the exprs for
# any such probes to reject
ex <- exprs(gset)
keep <- !apply(ex,1,function(x){TRUE %in% (x<7.8)})
exprs(gset) <- ex[keep,]

# set up the data and proceed with analysis
fl <- as.factor(sml)
gset$description <- fl

# check assignments were done properly
pData(gset)[,c(1,12,20)]

# some visualization
boxplot(exprs(gset))
plotMDS(gset,labels=gset$description)

# set design and perform fit
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)

# for contrasts, determine expression difference between treatment and
# corresponding control
cont.matrix <- makeContrasts(KD126="KD126-KDctrl",OE126="OE126-OEctrl",
                             levels=design)

fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
# peruse results
topTable(fit2,coef=1,adjust.method = "BH")[,c(1,13,31,34,35)]
topTable(fit2,coef=2,adjust.method = "BH")[,c(1,13,31,34,35)]

 

from the diagnostic plots in the middle, the clustering seems a little all over the place. Boxplots look median distributed (mostly). In searching, I came across someone else who mentioned this issue Unexpectedly high FDR using contrasts in Limma? about 9 years ago, the resolution of which seemed to be that there wasn't a problem with the analysis. I wanted to check here though, as I get similar problems with GSE40189 and GSE69630. The latter two have not been published, the one described above has with biological conclusions drawn from the data. The authors used GeneSpring and an intensity cutoff (included in the code I'm using) but still, it doesn't look like I can conclude anything from either experiment. This was what they had to say:

Microarray

RNA from transduced human CB cells was extracted using Trizol (Invitrogen) and gene expression assayed on HT-12_v4 microarrays (Illumina). Quantile normalization was performed and probes were filtered by detection p-value (<=0.1) (GeneSpring GX, Agilent). Next, to remove uninformative probes, those that did not exceed a threshold of 7.8 in all replicates of any one condition were eliminated, leaving 15812 probes for analysis.

Am I doing something wrong? Is there a best practice for working with Illumina data when the raw data is not available? All the affymetrix GSE sets I've done have gone off without a hitch. I see in the GEO entry for the Illumina ones SOFT files, but again have found no workflow for working with them, the preference being for GSEMatrix. Thanks in advance for any suggestions.

Jason

R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
[3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
[5] LC_TIME=English_Canada.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] limma_3.26.5         GEOquery_2.36.0      Biobase_2.30.0      
[4] BiocGenerics_0.16.1  BiocInstaller_1.20.1

loaded via a namespace (and not attached):
[1] tools_3.2.3    RCurl_1.95-4.7 bitops_1.0-6   XML_3.98-1.3  
limma p values microarray illumina • 1.6k views
ADD COMMENT
3
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States

This is a really nice reproducible example.  Running it myself, it looks like there is little evidence of differential expression in the first contrast and perhaps 5 or six genes in the second contrast that could be considered significant (FDR<0.5, for example).  Unfortunately, this probably means that these experiments are not "positive".  You could try using gene set testing to look for subtle effects among larger groups of genes, but those types of results, even when positive, are often hard to interpret.  

ADD COMMENT
0
Entering edit mode

thanks - I leaned heavily on your GEOquery/GEOmetadb manual for sorting the different handling between GDS/GSE/GSM formats actually. While I'm glad I'm not completely out to lunch, I was hoping that maybe I *was* doing something glaringly obviously wrong that would be easily fixable. I'd've rather felt silly and have it fixed, than feel clever but the data be unusable :-)

ADD REPLY
3
Entering edit mode
@james-w-macdonald-5106
Last seen 53 minutes ago
United States

The only thing I see is that you mis-interpreted the filtering criteria. They required at least one group of samples to have an expression level > 7.8, not all groups. This is a biased way to filter, and if you want to do something like that you should instead use group size (e.g. 3 samples), rather than the groups themselves. The filtering method you use is particularly inefficient as well. Something like

ind <- rowSums(exprs(gset) > 7.8) >= 3

gset <- gset[ind,]

Is very fast, being vectorized, and will ensure that at least three samples per gene exceeded the cutoff, without specifying group membership (which biases the filtering towards your hypothesis).

If  you do a PCA plot of the data, you will see that all the sample types are in one big blob, rather than being separated into groups. This indicates that, at a high level, there are no glaringly large differences. Or alternatively, it might indicate that there are technical or biological confounders that are not being controlled for, although I don't see anything obvious. Either way, it's not surprising at all that you see no differences, given the sparse number of replicates and lack of separation on the first two PCs.

This is, in my experience, the norm rather than the exception. Often times people try to detect subtle differences with woefully underpowered experiments (funding constraints being what they are), resulting in no genes that are different enough to survive adjustments for multiplicity.
 

ADD COMMENT
0
Entering edit mode

Ah ok, thanks. I think I overinterpreted what they did. I thought to remove an entire probe if even one sample failed because I wouldn't be able to make the comparison in the final analysis anyway if probes were missing for some groups. I also worried that some might have been included if some low samples were compensated by higher values which is why I went for the (admittedly more cumbersome) check of each value. I didn't realize rowSums would work by group so thanks for the tip - I'm still working on my R skills. Shame that the data are not usable, and that the study's already been out for a few years. I thought I just had bad luck hitting so many in my first week that were off. At least I'll know better (and keep the literature clean) when I do my own

ADD REPLY

Login before adding your answer.

Traffic: 738 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6