Search
Question: Interpretation of mroast gene set analysis in R and appropriate use of the function barcode plot regarding a microarray dataset
0
gravatar for svlachavas
24 months ago by
svlachavas570
Greece/Athens/National Hellenic Research Foundation
svlachavas570 wrote:

Dear Bioconductor Community,

as i wanted to implement mroast gene set testing to a merged microarray dataset i have analysed, i used the following code to test especially for KEGG pathways from the c2 gene set from Broad Institute. In detail:

> eset.3 # My expression set which used further for limma paired analysis 
ExpressionSet (storageMode: lockedEnvironment)
assayData: 9823 features, 60 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: St_1_WL57.CEL St_2_WL57.CEL ... 1554_03_Gemmer_1.CEL (60 total)
  varLabels: Disease Location Meta_factor Study
  varMetadata: labelDescription
featureData
  featureNames: 10_at 100_at ... 9997_at (9823 total)
  fvarLabels: PROBEID SYMBOL ENTREZID
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: hgu133plus2hsentrezg hgu133ahsentrezg

> load("human_c2_v5 (4).rdata")
> kegg.only <- Hs.c2[grep("KEGG", names(Hs.c2))]
> indices <- ids2indices(kegg.only, fData(eset.3)$ENTREZID)
> set.size <- sapply(indices, FUN=length)
> indices2 <- indices[set.size >= 10]
> res <- mroast(eset.3, indices2, design, contrast=2) # for the general comparison of interest

> head(res)

                                                        NGenes   PropDown    PropUp          Direction  PValue
KEGG_PROTEASOME                      39           0.00000000 0.9487179        Up         0.001
KEGG_DNA_REPLICATION               33           0.03030303 0.9393939        Up         0.001
KEGG_AMINOACYL_TRNA_BIOSYNTHESIS     30 0.03333333 0.8666667        Up  0.001
KEGG_RNA_POLYMERASE               21           0.00000000    0.8571429        Up  0.001
KEGG_NUCLEOTIDE_EXCISION_REPAIR      40 0.05000000 0.8250000        Up  0.001
KEGG_MISMATCH_REPAIR                   22 0.04545455 0.8181818        Up  0.001
                                                    FDR                    PValue.Mixed FDR.Mixed
KEGG_PROTEASOME                  0.002119048        0.001              0.001
KEGG_DNA_REPLICATION             0.002119048        0.001            0.001
KEGG_AMINOACYL_TRNA_BIOSYNTHESIS 0.002119048        0.001     0.001
KEGG_RNA_POLYMERASE              0.002119048        0.001     0.001
KEGG_NUCLEOTIDE_EXCISION_REPAIR  0.002119048        0.001     0.001
KEGG_MISMATCH_REPAIR             0.002119048        0.001     0.001....

Thus, my main questions are:

1) If i want to sort somehow the "most significant results" of mroast, for including a top-10 for instance of most significant pathways, i should order my results based on the FDR statistic ? Or the FDR mixed is more appropriate ?

Furthermore, could i also consider an FDR cutoff ?? i.e. < 0.01 ? Even if it still returns a "big" number of results ? i.e 

> sum(res$FDR < 0.01)
[1] 64

2) Secondly, as i further wanted to visualize the "behavior" of a specific KEGG pathway founded significant in the above results of mroast---for instance "KEGG_DNA_REPLICATION", i used the following code:

> barcodeplot(statistics = top2$t, index="KEGG_DNA_REPLICATION")

where top2 relates to the contrast=2 above specifying the coefficient of interest:

> head(top2)
              PROBEID   SYMBOL ENTREZID       logFC  AveExpr         t      P.Value    adj.P.Val
10_at       10_at     NAT2       10 -2.59651160 8.135373 -4.694594 5.166769e-05 0.0005398345
100_at     100_at      ADA      100  0.41261659 3.909195  2.077495 4.614667e-02 0.0877310422
1000_at   1000_at     CDH2     1000  0.03096866 3.447650  0.534959 5.965028e-01 0.6869199264
10000_at 10000_at     AKT3    10000 -0.13471755 3.691361 -1.323980 1.952097e-01 0.2840772628
10001_at 10001_at     MED6    10001  0.45086268 7.727979  4.951596 2.479456e-05 0.0003611121
10004_at 10004_at NAALADL1    10004 -1.59148861 4.146636 -4.540551 8.009259e-05 0.0006956372
                 B
10_at     1.888538
100_at   -4.365374
1000_at  -6.246405
10000_at -5.534261
10001_at  2.584195
10004_at  1.473780

> dim(top2)
[1] 9823    9

But then an error message returned:

Error in `$<-.data.frame`(`*tmp*`, "idx", value = c(FALSE, FALSE, FALSE,  : 

  replacement has 9824 rows, data has 9823 

So in this case what is the problem ? Or am i interpreting wrong the arguments of barcode function??

 

Thank you for your consideration on this matter !!!

Efstathios

ADD COMMENTlink modified 24 months ago by Gordon Smyth32k • written 24 months ago by svlachavas570
2
gravatar for Gordon Smyth
24 months ago by
Gordon Smyth32k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth32k wrote:

When you run mroast() on a large number of sets, you need to set 'nrot' to a larger number of rotations than the default. Otherwise you will get a lot of p-values that are all equal (in your case, on the minimum possible value of 0.001). Or alternatively you could try fry() instead of mroast().

The output from mroast() is already ordered, so you don't need to reorder it yourself. See the 'sort' argument to mroast().

When you run barcodeplot(), the 'index' is a vector of indices, just like for roast(). An appropriate call to barcodeplot() might be:

barcodeplot( fit$t[,2], index=indices2[["KEGG_DNA_REPLICATION"]] )

where 'fit' is the object from eBayes().

ADD COMMENTlink modified 24 months ago • written 24 months ago by Gordon Smyth32k

Dear Gordon,

thank you for your help !! Based on your comments, i have some small further questions :

1) Regarding the number of rotations-based on your experience, should i use something even nrot=10,000 or something similar ? This should be enough regarding the large number of sets ? Or something smaller ?

2) Secondly, regarding the argument "sort" of mroast- i checked it and the default value is "directional"--thus, the returned resulted are returned ordered by the two-sided directional p-value ? And essentially, this p-value is associated with the outcome of the direction of specific gene-set enriched-and finally characterized as "up" or "down" in the output of mroast ?

3) Maybe you meant instead of the output of lmFit the output of eBayes ? Because the latter includes the t-statistics---Nevertheless, the topTable top2 i mentioned above was created with the argument sort.by="none". When i used your above function to test it:

barcodeplot( fit2$t[,2], index=indices2[["KEGG_DNA_REPLICATION"]] )

It runned without an error

4) My final question is about the interpretation of the barcodeplot (below link from my dropbox account)

https://www.dropbox.com/s/zkzbc7d0d28ibhx/image_trial_Mroast_output.png?dl=0

Briefly speaking, how can i associate the visualized enrichment score(i.e. 5.5) with the running t-statistics, for the specific given pathway ? For instance the enrichment score is higher for my up-regulated genes(positive t-statistics) found enriched in the KEGG pathway DNA_Replication ? which also correlated with the Up direction of the mroast output ? Or it has a different explanation ?

Thank you for your time and patience !!

Efstathios

ADD REPLYlink modified 24 months ago • written 24 months ago by svlachavas570
1

At least nrot=10000. The more the better, as many as you have time to run.

ADD REPLYlink written 24 months ago by Gordon Smyth32k

Dear Gordon, thank you for your confirmation about nrot --thus, the bigger the number of rotations, the more "strigent" the returned p-values ? Also, regarding the barcode plot, my approach is correct ?

ADD REPLYlink written 23 months ago by svlachavas570
1

More rotations doesn't make the p-values more stringent, it makes them more accurate, especially for small p-values. With N rotations, the smallest p-value you can get is 1/N, since p-values are computed by comparing to the empirical null distribution defined by the rotations, and there is no information about the null distribution past the last rotation.

ADD REPLYlink written 23 months ago by Ryan C. Thompson6.1k

Dear Ryan, thank you for your explanation and correction comment--Thus, with a bigger number of rotations, the p-values of the top results-especially the small ones are more accurate and "valid" to further take into consideration, regarding the enrichment results ? And further, i can consider these for instance with an FDR cutoff les than 0.05 or 0.01 ?

 

ADD REPLYlink written 23 months ago by svlachavas570
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 392 users visited in the last hour