Hi
I am trying to get the DE genes from a GEO dataset (my fourth one)
I began the analysis like I did for the three other datasets, for which I had no problems. But for this one, I can't see what I did differently that make it so that the TopTable give only one adj.P.Val, the same for every probes, extremely close to one.
I have to say that my previous datasets were Hgu133 or Illumina HT12, and that this one is a GeneST.
EDIT : Note : in the paper, they analyzed data with Affymetrix Expression Console. Could it be that I need to start over from the raw data and perform the normalization myself since AEC/TAC data are not compatible with limma ?
Here is my code as a MWE:
library("ggplot2") library("Biobase") library("GEOquery") library("limma") eset <- getGEO("GSE76736", GSEMatrix = TRUE) eset <- eset[[1]] ### Change Annotations ph = eset@phenoData source <- data.frame(c("M0", "M0", "M0", "Mtgf", "Mtgf", "Mtgf" ,"M1", "M1", "M1", "M2a", "M2a", "M2a" , "M2c", "M2c", "M2c")) colnames(source) <- "Source" ph@data <- cbind(ph@data, source) ph@data <- cbind(ph@data[,42 ,drop = FALSE], ph@data[,1:41 , drop = FALSE]) ### QC ### boxplot(exprs(eset), las = 2) # Normalization has already be done # set design and perform fit groups = ph@data$Source design <- model.matrix(~0 + groups) colnames(design) <- levels(groups) fit <- lmFit(eset, design) ##Set contrasts contrast.matrix <- makeContrasts(M2a-M1, M2c-M1, Mtgf-M1, M1-M0, M2a-M0, M2c-M0, Mtgf-M0 ,levels=design) fit.con = contrasts.fit(fit,contrast.matrix) fit.con.eb <- eBayes(fit.con, trend=TRUE) topTable(fit.con.eb, coef=1, adjust="BH")[,9:14]
And the output :
logFC AveExpr t P.Value adj.P.Val B 16653793 -1.6888267 5.416873 -5.623638 8.459996e-05 0.999899 -3.317159 16710138 -0.5723500 3.545227 -5.348773 1.347648e-04 0.999899 -3.364100 16655115 1.1535733 4.186269 5.291836 1.486044e-04 0.999899 -3.374322 16885489 -0.6383167 3.807446 -5.263675 1.559915e-04 0.999899 -3.379443 16656411 1.0819533 4.369143 5.254132 1.585813e-04 0.999899 -3.381188 16655785 -2.9586867 3.710191 -5.182546 1.795004e-04 0.999899 -3.394442 16813419 0.6318233 2.975794 5.133235 1.955734e-04 0.999899 -3.403739 16655665 -0.8731133 5.076277 -5.102969 2.061775e-04 0.999899 -3.409514 16874922 -2.5489900 7.226400 -5.096578 2.084924e-04 0.999899 -3.410740 16920528 -0.5957867 3.682801 -5.037169 2.313557e-04 0.999899 -3.422252
Session Info :
R version 3.2.4 Revised (2016-03-16 r70336) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) locale: [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C [5] LC_TIME=French_France.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] limma_3.26.9 GEOquery_2.36.0 Biobase_2.30.0 BiocGenerics_0.16.1 [5] ggplot2_2.1.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.5 XML_3.98-1.4 bitops_1.0-6 grid_3.2.4 plyr_1.8.3 [6] gtable_0.2.0 scales_0.4.0 splines_3.2.4 tools_3.2.4 munsell_0.4.3 [11] RCurl_1.95-4.8 colorspace_1.2-6
Hum, I am pretty sure there is a problem, since :
It's not just stretches, here I get adj.P.val =
0.999899
forcoef=1
for all the 53617 probesAny idea why that might happen ?
In other words, it's a stretch of 53617 genes with the same adjusted p-value. That's not uncommon when you have no DE genes. Just have a look at your top-ranking gene; it has a p-value of just under 10e-4. After correction for 50000 tests, it basically becomes an adjusted p-value of 1. If your top-ranking gene is not DE, what hope can you have for the rest? The same applies for your other contrasts, i.e., none of your genes are DE in any of your comparisons.
I would suggest you double-check the pre-processing steps, i.e., normalization, log-transformation, filtering, etc. Also check if there was a batch effect that they had to block on, e.g., one replicate in each group belongs to a batch. From what I can see, there's nothing wrong with the limma code, so it must be something else going on if your results are not similar to what was published. See if you can get their analysis code; it's notoriously hard to reproduce an analysis from its plain-text description.
OK, I think I get it. Maybe if I filter out genes (like probes without entrez ID) I will get something significant. I may also remove low expression genes.
I won't be able to reproduce the code though, since they used affymetrix softwares. But I don't intend to either, if I can get a list of DE genes for this experience, it's enough, it doesn't have to be exactly the same processing as they did.
Actually, if I start from the CEL files instead of the expression set GEO query provide, It might be even better.
Hi, so I did myself the whole preprocessing and filtering.
I have in the end 38598 probeset that are associated to a gene. Again, topTable give me, for one comparison, an adj.P.Val close to one (and always the same).
Even if I remove more data by excluding gene with a low Amean (
fit.con.eb <- eBayes(fit.con[fit.con$Amean>6,], trend=TRUE)
), it still don't get correct adj.P.Val.However, with
topTable(fit.con.eb)
(so all contrasts) I lower the adj.P.Value too about 0.1 (at best)I am kind of lost, cause I don't see what I did wrong here...
Why do you assume you have done something wrong? Did you read the underlying paper that these data come from? Here is the relevant section from that paper:
Gene expression upregulated or downregulated by at least 2-fold with p < 0.05 was considered to be statistically significant.
What is your goal here? To reproduce their results, or to use their data for your own purposes (and by reproduce, I am using Jeff Leek's definition from here)? If you are trying to reproduce, then you have to actually know exactly what they did, because if you do something different then you are re-analyzing, not reproducing. In which case what does it matter if your results are different?
Yes, I read the paper and no, I am not trying to reproduce, but I want to analyze their data by myself (to answer my questions which differ from theirs).
I can't reproduce, since, as it is described, they performed the analysis using Affymetrix Expression Console and Transcriptome Analysis console, and I am using R and Bioconductor to do it.
I can't imagine that there is no DE genes between the conditions, because I know that there is differential expression between the studied conditions. So there must be something wrong somewhere !
They used Affymetrix Expression Console to fit RMA, and then used TAC to fit a one-way ANOVA, which is almost identical to what you have done. Using TAC on a Gene ST array isn't like an HTA array, IIRC.
And how do you 'know' that there are differentially expressed genes? They used an unadjusted p-value and a fold change criterion, which I can assure you means that they didn't have anything when they did FDR. And I 'know' this to be true, because nobody would try to publish a paper using an unadjusted p-value and fold change if they could have used FDR. With FDR, reviewers in general will have no problem with your paper, but it's a more difficult sell if you use the unadjusted p-value and fold change (not impossible, obviously, but a harder sell, and maybe at a lower impact factor).
^^ I "know" it because I already analyzed other dataset with similar conditions, on affy hgu133 and Illumina HT 12, and found DE genes without problems.
I'll have to investigate the batch effect though. Well thank you for your help, and let that be a lesson for me :)
This gets back to what Ryan said. You have three replicates per group, and if you look at a PCA plot (which you should have done), you will see that the first principal component is dominated by, um, something that likely has nothing to do with the underlying biology. With only three replicates, and outliers for three of the treatments, you won't likely get much. You can salvage one contrast by including array weights.
Or you can try surrogate variables
Which helps as well, but nothing is likely to 'fix' those outliers, which is probably why the authors used an unadjusted p and FC criteria.
Well I checked the PCA plot yesterday after our talk, and this confirmed tour statements that the dataset was rubbish. I didn't understood that they were using unadjusted P in the paper (though they just simplified the term, but meant the adjusted p.val).
Thanks for the salvation code, sadly, the 6th contrast (if I understood what you did) is not the most informative contrast I need (doesn't mean much without the others). So I guess I just leave this dataset behind me, and hopes for the best about my "homemade" results !
Thank you both so much for the help and explanations!
"I didn't understood that they were using unadjusted P in the paper (thought they ... meant the adjusted p.val)." - And this is exactly why people submit papers with no statistical significance at all, hoping readers and referees will make errors like this when reading it and thereby be deceived.
Even if you know that there is true differential expression, it's possible that the samples from this data set were not of sufficient quality to detect it, due to errors made during sample prep, hidden batch effects obscuring the true effect, etc. Remember that an adjusted p-value of 1 doesn't mean there is no differential expression, it just means that this particular data set does not show any evidence of differential expression.
Note that the original study seems to use a threshold on the raw p-value. If you were to count the number of genes with p-values below 0.05 reported by limma, you would probably find plenty of apparently DE genes. However, this is rarely advisable for genomics studies; the adjusted p-value is a much better choice to avoid getting spurious results. Don't give in to temptation just to get some genes - even if their p-values are below 0.05, an adjusted p-value near 1 means that almost all of your "discoveries" will be false positives.
So I kind of have no other choice than abandon this dataset ?
Is it frequent to have zero evidence of differential expression ? Because I am waiting for my own results of an HTA chip, I don't want to tell my chief that it will be unusable because of bad luck :(
You could try applying SVA, RUV, etc. to see if there's hidden batch effects. If they're present, blocking on them would reduce your variance and increase power. Other than that, though, I don't have many ideas for what you could do; remember, rubbish in, rubbish out.
As for whether it's frequent to get no evidence of DE; well, it obviously varies depending on your biological system (e.g., I'd be shocked to get no detectable DE for tumor/normal comparisons) and on how good you or your collaborators are at experimental work. You can't really control the former, but you can control the latter with experimental design, fine-tuned protocols, etc.