Dear Bioconductor Community,
as i wanted to implement mroast gene set testing to a merged microarray dataset i have analysed, i used the following code to test especially for KEGG pathways from the c2 gene set from Broad Institute. In detail:
> eset.3 # My expression set which used further for limma paired analysis
ExpressionSet (storageMode: lockedEnvironment)
assayData: 9823 features, 60 samples
element names: exprs
protocolData: none
phenoData
sampleNames: St_1_WL57.CEL St_2_WL57.CEL ... 1554_03_Gemmer_1.CEL (60 total)
varLabels: Disease Location Meta_factor Study
varMetadata: labelDescription
featureData
featureNames: 10_at 100_at ... 9997_at (9823 total)
fvarLabels: PROBEID SYMBOL ENTREZID
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: hgu133plus2hsentrezg hgu133ahsentrezg
> load("human_c2_v5 (4).rdata")
> kegg.only <- Hs.c2[grep("KEGG", names(Hs.c2))]
> indices <- ids2indices(kegg.only, fData(eset.3)$ENTREZID)
> set.size <- sapply(indices, FUN=length)
> indices2 <- indices[set.size >= 10]
> res <- mroast(eset.3, indices2, design, contrast=2) # for the general comparison of interest
> head(res)
NGenes PropDown PropUp Direction PValue
KEGG_PROTEASOME 39 0.00000000 0.9487179 Up 0.001
KEGG_DNA_REPLICATION 33 0.03030303 0.9393939 Up 0.001
KEGG_AMINOACYL_TRNA_BIOSYNTHESIS 30 0.03333333 0.8666667 Up 0.001
KEGG_RNA_POLYMERASE 21 0.00000000 0.8571429 Up 0.001
KEGG_NUCLEOTIDE_EXCISION_REPAIR 40 0.05000000 0.8250000 Up 0.001
KEGG_MISMATCH_REPAIR 22 0.04545455 0.8181818 Up 0.001
FDR PValue.Mixed FDR.Mixed
KEGG_PROTEASOME 0.002119048 0.001 0.001
KEGG_DNA_REPLICATION 0.002119048 0.001 0.001
KEGG_AMINOACYL_TRNA_BIOSYNTHESIS 0.002119048 0.001 0.001
KEGG_RNA_POLYMERASE 0.002119048 0.001 0.001
KEGG_NUCLEOTIDE_EXCISION_REPAIR 0.002119048 0.001 0.001
KEGG_MISMATCH_REPAIR 0.002119048 0.001 0.001....
Thus, my main questions are:
1) If i want to sort somehow the "most significant results" of mroast, for including a top-10 for instance of most significant pathways, i should order my results based on the FDR statistic ? Or the FDR mixed is more appropriate ?
Furthermore, could i also consider an FDR cutoff ?? i.e. < 0.01 ? Even if it still returns a "big" number of results ? i.e
> sum(res$FDR < 0.01)
[1] 64
2) Secondly, as i further wanted to visualize the "behavior" of a specific KEGG pathway founded significant in the above results of mroast---for instance "KEGG_DNA_REPLICATION", i used the following code:
> barcodeplot(statistics = top2$t, index="KEGG_DNA_REPLICATION")
where top2 relates to the contrast=2 above specifying the coefficient of interest:
> head(top2)
PROBEID SYMBOL ENTREZID logFC AveExpr t P.Value adj.P.Val
10_at 10_at NAT2 10 -2.59651160 8.135373 -4.694594 5.166769e-05 0.0005398345
100_at 100_at ADA 100 0.41261659 3.909195 2.077495 4.614667e-02 0.0877310422
1000_at 1000_at CDH2 1000 0.03096866 3.447650 0.534959 5.965028e-01 0.6869199264
10000_at 10000_at AKT3 10000 -0.13471755 3.691361 -1.323980 1.952097e-01 0.2840772628
10001_at 10001_at MED6 10001 0.45086268 7.727979 4.951596 2.479456e-05 0.0003611121
10004_at 10004_at NAALADL1 10004 -1.59148861 4.146636 -4.540551 8.009259e-05 0.0006956372
B
10_at 1.888538
100_at -4.365374
1000_at -6.246405
10000_at -5.534261
10001_at 2.584195
10004_at 1.473780
> dim(top2)
[1] 9823 9
But then an error message returned:
Error in `$<-.data.frame`(`*tmp*`, "idx", value = c(FALSE, FALSE, FALSE, :
replacement has 9824 rows, data has 9823
So in this case what is the problem ? Or am i interpreting wrong the arguments of barcode function??
Thank you for your consideration on this matter !!!
Efstathios
Dear Gordon,
thank you for your help !! Based on your comments, i have some small further questions :
1) Regarding the number of rotations-based on your experience, should i use something even nrot=10,000 or something similar ? This should be enough regarding the large number of sets ? Or something smaller ?
2) Secondly, regarding the argument "sort" of mroast- i checked it and the default value is "directional"--thus, the returned resulted are returned ordered by the two-sided directional p-value ? And essentially, this p-value is associated with the outcome of the direction of specific gene-set enriched-and finally characterized as "up" or "down" in the output of mroast ?
3) Maybe you meant instead of the output of lmFit the output of eBayes ? Because the latter includes the t-statistics---Nevertheless, the topTable top2 i mentioned above was created with the argument sort.by="none". When i used your above function to test it:
barcodeplot( fit2$t[,2], index=indices2[["KEGG_DNA_REPLICATION"]] )
It runned without an error
4) My final question is about the interpretation of the barcodeplot (below link from my dropbox account)
https://www.dropbox.com/s/zkzbc7d0d28ibhx/image_trial_Mroast_output.png?dl=0
Briefly speaking, how can i associate the visualized enrichment score(i.e. 5.5) with the running t-statistics, for the specific given pathway ? For instance the enrichment score is higher for my up-regulated genes(positive t-statistics) found enriched in the KEGG pathway DNA_Replication ? which also correlated with the Up direction of the mroast output ? Or it has a different explanation ?
Thank you for your time and patience !!
Efstathios
At least nrot=10000. The more the better, as many as you have time to run.
Dear Gordon, thank you for your confirmation about nrot --thus, the bigger the number of rotations, the more "strigent" the returned p-values ? Also, regarding the barcode plot, my approach is correct ?
More rotations doesn't make the p-values more stringent, it makes them more accurate, especially for small p-values. With N rotations, the smallest p-value you can get is 1/N, since p-values are computed by comparing to the empirical null distribution defined by the rotations, and there is no information about the null distribution past the last rotation.
Dear Ryan, thank you for your explanation and correction comment--Thus, with a bigger number of rotations, the p-values of the top results-especially the small ones are more accurate and "valid" to further take into consideration, regarding the enrichment results ? And further, i can consider these for instance with an FDR cutoff les than 0.05 or 0.01 ?