TopTable Problem : Unique Adj.P.Val which is near 1
1
0
Entering edit mode
giroudpaul ▴ 40
@giroudpaul-10031
Last seen 2.4 years ago
France

Hi

I am trying to get the DE genes from a GEO dataset (my fourth one)

I began the analysis like I did for the three other datasets, for which I had no problems. But for this one, I can't see what I did differently that make it so that the TopTable give only one adj.P.Val, the same for every probes, extremely close to one.

I have to say that my previous datasets were Hgu133 or Illumina HT12, and that this one is a GeneST.

EDIT : Note : in the paper, they analyzed data with Affymetrix Expression Console. Could it be that I need to start over from the raw data and perform the normalization myself since AEC/TAC data are not compatible with limma ?

Here is my code as a MWE:

library("ggplot2")
library("Biobase")
library("GEOquery")
library("limma")

eset <- getGEO("GSE76736", GSEMatrix = TRUE)
eset <- eset[[1]]

### Change Annotations
ph = eset@phenoData
source <- data.frame(c("M0", "M0", "M0", "Mtgf", "Mtgf", "Mtgf"
,"M1", "M1", "M1", "M2a", "M2a", "M2a"
, "M2c", "M2c", "M2c"))
colnames(source) <- "Source"
ph@data <- cbind(ph@data, source)
ph@data <- cbind(ph@data[,42 ,drop = FALSE],
ph@data[,1:41 , drop = FALSE])
### QC ###
boxplot(exprs(eset), las = 2)
# Normalization has already be done

# set design and perform fit
groups = ph@data$Source design <- model.matrix(~0 + groups) colnames(design) <- levels(groups) fit <- lmFit(eset, design) ##Set contrasts contrast.matrix <- makeContrasts(M2a-M1, M2c-M1, Mtgf-M1, M1-M0, M2a-M0, M2c-M0, Mtgf-M0 ,levels=design) fit.con = contrasts.fit(fit,contrast.matrix) fit.con.eb <- eBayes(fit.con, trend=TRUE) topTable(fit.con.eb, coef=1, adjust="BH")[,9:14] And the output :  logFC AveExpr t P.Value adj.P.Val B 16653793 -1.6888267 5.416873 -5.623638 8.459996e-05 0.999899 -3.317159 16710138 -0.5723500 3.545227 -5.348773 1.347648e-04 0.999899 -3.364100 16655115 1.1535733 4.186269 5.291836 1.486044e-04 0.999899 -3.374322 16885489 -0.6383167 3.807446 -5.263675 1.559915e-04 0.999899 -3.379443 16656411 1.0819533 4.369143 5.254132 1.585813e-04 0.999899 -3.381188 16655785 -2.9586867 3.710191 -5.182546 1.795004e-04 0.999899 -3.394442 16813419 0.6318233 2.975794 5.133235 1.955734e-04 0.999899 -3.403739 16655665 -0.8731133 5.076277 -5.102969 2.061775e-04 0.999899 -3.409514 16874922 -2.5489900 7.226400 -5.096578 2.084924e-04 0.999899 -3.410740 16920528 -0.5957867 3.682801 -5.037169 2.313557e-04 0.999899 -3.422252 Session Info : R version 3.2.4 Revised (2016-03-16 r70336) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) locale: [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C [5] LC_TIME=French_France.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] limma_3.26.9 GEOquery_2.36.0 Biobase_2.30.0 BiocGenerics_0.16.1 [5] ggplot2_2.1.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.5 XML_3.98-1.4 bitops_1.0-6 grid_3.2.4 plyr_1.8.3 [6] gtable_0.2.0 scales_0.4.0 splines_3.2.4 tools_3.2.4 munsell_0.4.3 [11] RCurl_1.95-4.8 colorspace_1.2-6 limma toptable • 1.7k views ADD COMMENT 2 Entering edit mode Aaron Lun ★ 27k @alun Last seen 12 hours ago The city by the bay There's no bug here. It's common to have stretches of genes with the same adjusted p-value, due to the enforced monotonicity. Otherwise you could end up with genes that have smaller p-values but larger adjusted p-values (due to the use of the rank to calculate the latter) which would be silly. The fact that they're all large just means you don't have any DE genes. ADD COMMENT 0 Entering edit mode Hum, I am pretty sure there is a problem, since : 1. The data were published and they do find DE genes 2. It's not just stretches, here I get adj.P.val = 0.999899 for coef=1 for all the 53617 probes 3. I get similar value for all contrasts Any idea why that might happen ? ADD REPLY 1 Entering edit mode In other words, it's a stretch of 53617 genes with the same adjusted p-value. That's not uncommon when you have no DE genes. Just have a look at your top-ranking gene; it has a p-value of just under 10e-4. After correction for 50000 tests, it basically becomes an adjusted p-value of 1. If your top-ranking gene is not DE, what hope can you have for the rest? The same applies for your other contrasts, i.e., none of your genes are DE in any of your comparisons. I would suggest you double-check the pre-processing steps, i.e., normalization, log-transformation, filtering, etc. Also check if there was a batch effect that they had to block on, e.g., one replicate in each group belongs to a batch. From what I can see, there's nothing wrong with the limma code, so it must be something else going on if your results are not similar to what was published. See if you can get their analysis code; it's notoriously hard to reproduce an analysis from its plain-text description. ADD REPLY 0 Entering edit mode OK, I think I get it. Maybe if I filter out genes (like probes without entrez ID) I will get something significant. I may also remove low expression genes. I won't be able to reproduce the code though, since they used affymetrix softwares. But I don't intend to either, if I can get a list of DE genes for this experience, it's enough, it doesn't have to be exactly the same processing as they did. Actually, if I start from the CEL files instead of the expression set GEO query provide, It might be even better. ADD REPLY 0 Entering edit mode Hi, so I did myself the whole preprocessing and filtering. I have in the end 38598 probeset that are associated to a gene. Again, topTable give me, for one comparison, an adj.P.Val close to one (and always the same). Even if I remove more data by excluding gene with a low Amean (fit.con.eb <- eBayes(fit.con[fit.con$Amean>6,], trend=TRUE)), it still don't get correct adj.P.Val.

However, with topTable(fit.con.eb) (so all contrasts) I lower the adj.P.Value too about 0.1 (at best)

  AveExpr F P.Value adj.P.Val 16698200 7.736835 30.59728 1.739200e-06 0.06712965 16874922 7.220016 22.99897 8.779820e-06 0.16944175 16804152 5.054936 21.18824 1.382166e-05 0.17782949 17120874 8.745497 17.45709 3.945268e-05 0.28984691 16870821 6.934675 17.36243 4.061131e-05 0.28984691 17120866 10.125806 16.94454 4.621859e-05 0.28984691 17120064 8.847598 16.53694 5.256564e-05 0.28984691 16858654 6.693285 15.30242 7.889373e-05 0.38064253 17120062 8.841739 14.79589 9.390832e-05 0.40274149 16777460 10.654720 14.47014 1.053064e-04 0.40646176 

I am kind of lost, cause I don't see what I did wrong here...

3
Entering edit mode

Why do you assume you have done something wrong? Did you read the underlying paper that these data come from?  Here is the relevant section from that paper:

Gene expression upregulated or downregulated by at least 2-fold with p < 0.05 was considered to be statistically significant.

What is your goal here? To reproduce their results, or to use their data for your own purposes (and by reproduce, I am using Jeff Leek's definition from here)? If you are trying to reproduce, then you have to actually know exactly what they did, because if you do something different then you are re-analyzing, not reproducing. In which case what does it matter if your results are different?

0
Entering edit mode

Yes, I read the paper and no, I am not trying to reproduce, but I want to analyze their data by myself (to answer my questions which differ from theirs).

I can't reproduce, since, as it is described, they performed the analysis using Affymetrix Expression Console and Transcriptome Analysis console, and I am using R and Bioconductor to do it.

I can't imagine that there is no DE genes between the conditions, because I know that there is differential expression between the studied conditions. So there must be something wrong somewhere !

3
Entering edit mode

They used Affymetrix Expression Console to fit RMA, and then used TAC to fit a one-way ANOVA, which is almost identical to what you have done. Using TAC on a Gene ST array isn't like an HTA array, IIRC.

And how do you 'know' that there are differentially expressed genes? They used an unadjusted p-value and a fold change criterion, which I can assure you means that they didn't have anything when they did FDR. And I 'know' this to be true, because nobody would try to publish a paper using an unadjusted p-value and fold change if they could have used FDR. With FDR, reviewers in general will have no problem with your paper, but it's a more difficult sell if you use the unadjusted p-value and fold change (not impossible, obviously, but a harder sell, and maybe at a lower impact factor).

0
Entering edit mode

^^ I "know" it because I already analyzed other dataset with similar conditions, on affy hgu133 and Illumina HT 12, and found DE genes without problems.

I'll have to investigate the batch effect though. Well thank you for your help, and let that be a lesson for me :)

1
Entering edit mode

This gets back to what Ryan said. You have three replicates per group, and if you look at a PCA plot (which you should have done), you will see that the first principal component is dominated by, um, something that likely has nothing to do with the underlying biology. With only three replicates, and outliers for three of the treatments, you won't likely get much. You can salvage one contrast by including array weights.

> wts <- arrayWeights(eset, design)
> fit <- lmFit(eset, design, weights = wts)
> fit2 <- eBayes(contrasts.fit(fit, contrast))
> sapply(1:ncol(contrast), function(x) nrow(topTable(fit2, x, Inf, p.value = 0.1)))
[1]  0  1  0  0  0 29  0

Or you can try surrogate variables

> design1 <- model.matrix(~grp)
> sv <- sva(exprs(eset), design1)
Number of significant surrogate variables is:  2
Iteration (out of 5 ):1  2  3  4  5  > sv$sv [,1] [,2] [1,] -0.07981621 -0.023416006 [2,] -0.11731567 -0.184954512 [3,] 0.39890894 0.026417724 [4,] -0.09642046 0.385974045 [5,] -0.11129591 0.362462567 [6,] -0.14264933 -0.235489883 [7,] -0.14687314 -0.462205823 [8,] 0.45193443 0.004642424 [9,] -0.17965656 0.058258785 [10,] -0.17653180 -0.013952924 [11,] -0.09758625 -0.042627868 [12,] -0.14478396 -0.443087360 [13,] 0.65968022 -0.076204236 [14,] -0.10131286 0.316896520 [15,] -0.11628145 0.327286546 > design <- cbind(design, sv$sv)
> contrast <- rbind(contrast, matrix(0, ncol = ncol(contrast), nrow=2))
> fit <- lmFit(eset, design)
> fit2 <- eBayes(contrasts.fit(fit, contrast))
> sapply(1:ncol(contrast), function(x) nrow(topTable(fit2, x, Inf, p.value = 0.1)))
[1]   0   3   0   2   0 367   0

Which helps as well, but nothing is likely to 'fix' those outliers, which is probably why the authors used an unadjusted p and FC criteria.

0
Entering edit mode

Well I checked the PCA plot yesterday after our talk, and this confirmed tour statements that the dataset was rubbish. I didn't understood that they were using unadjusted P in the paper (though they just simplified the term, but meant the adjusted p.val).

Thanks for the salvation code, sadly, the 6th contrast (if I understood what you did) is not the most informative contrast I need (doesn't mean much without the others). So I guess I just leave this dataset behind me, and hopes for the best about my "homemade" results !

Thank you both so much for the help and explanations!

1
Entering edit mode

"I didn't understood that they were using unadjusted P in the paper (thought they ... meant the adjusted p.val)." - And this is exactly why people submit papers with no statistical significance at all, hoping readers and referees will make errors like this when reading it and thereby be deceived.

2
Entering edit mode

Even if you know that there is true differential expression, it's possible that the samples from this data set were not of sufficient quality to detect it, due to errors made during sample prep, hidden batch effects obscuring the true effect, etc. Remember that an adjusted p-value of 1 doesn't mean there is no differential expression, it just means that this particular data set does not show any evidence of differential expression.

0
Entering edit mode

Note that the original study seems to use a threshold on the raw p-value. If you were to count the number of genes with p-values below 0.05 reported by limma, you would probably find plenty of apparently DE genes. However, this is rarely advisable for genomics studies; the adjusted p-value is a much better choice to avoid getting spurious results. Don't give in to temptation just to get some genes - even if their p-values are below 0.05, an adjusted p-value near 1 means that almost all of your "discoveries" will be false positives.

0
Entering edit mode

So I kind of have no other choice than abandon this dataset ?

Is it frequent to have zero evidence of differential expression ? Because I am waiting for my own results of an HTA chip, I don't want to tell my chief that it will be unusable because of bad luck :(

1
Entering edit mode

You could try applying SVA, RUV, etc. to see if there's hidden batch effects. If they're present, blocking on them would reduce your variance and increase power. Other than that, though, I don't have many ideas for what you could do; remember, rubbish in, rubbish out.

As for whether it's frequent to get no evidence of DE; well, it obviously varies depending on your biological system (e.g., I'd be shocked to get no detectable DE for tumor/normal comparisons) and on how good you or your collaborators are at experimental work. You can't really control the former, but you can control the latter with experimental design, fine-tuned protocols, etc.