Question

TopGO significant genes

1

Entering edit mode

Al-a ▴ 10

@al-a-14627

Last seen 4.0 years ago

Switzerland

Hi,

I am facing some issues while processing my RNAseq outputs through the TopGO package to determine which GOterms are enriched in the significantly modulated gene sets extracted from DESeq2. When running topGO, I am getting significant GO terms that are associated with 0 of my significant genes, using a p-value threshold of p.value.elim.ks <= 0.01. Is this related to the type of statistical test used ? Please see below the function I used:

Where GO.type = either MF / BP / CC (one of the 3 GO categories) counts.matrix.id is the output data.frame of DESeq2 (using results()) with rownames corresponding to Ensembl Gene IDs. anSig is the subset of the counts.matrix.id corresponding to the significantly modulated genes. backG is the background of genes to be used to compute enrichment

RunTopGO <- function(GO.type,
                     counts.matrix.id,
                     anSig,
                     backG){


 geneIDs = rownames(x = counts.matrix.id)

  inUniverse = geneIDs %in% c(anSig$ENSEMBL, backG) 

  inSelection =  geneIDs %in% anSig$ENSEMBL 

  alg <- factor(x = as.integer(inSelection[inUniverse] ) )

  names(x = alg) <- geneIDs[inUniverse]

  tgd <- new("topGOdata",
             ontology = GO.type,
             allGenes = alg,
             nodeSize = 5,
             annot = annFUN.org,
             mapping ="org.Mm.eg.db",
             ID = "ensembl" )

  ## run tests
  resultTopGO.elim.fisher <- runTest(tgd,
                                     algorithm = "elim",
                                     statistic = "fisher" )
  resultTopGO.elim.ks <- runTest(tgd,
                                 algorithm = "elim",
                                 statistic = "ks" )
  resultTopGO.classic.fisher <- runTest(tgd,
                                        algorithm = "classic",
                                        statistic = "fisher" )
  resultTopGO.classic.ks <- runTest(tgd,
                                    algorithm = "classic",
                                    statistic = "ks" )
  ## look at results
  tab <- GenTable(tgd, Fisher.elim = resultTopGO.elim.fisher, 
                  Fisher.classic = resultTopGO.classic.fisher,
                  ks.elim = resultTopGO.elim.ks,
                  ks.classic = resultTopGO.classic.ks,
                  orderBy = "ks.elim" ,
                  topNodes = 200)

  results.list <- list(resultTopGO.elim.fisher = resultTopGO.elim.fisher,
                       resultTopGO.elim.ks = resultTopGO.elim.ks,
                       resultTopGO.classic.fisher = resultTopGO.classic.fisher,
                       resultTopGO.classic.ks = resultTopGO.classic.ks)

  data.list <- mapply(GO.data = results.list,
                      col.name = names(results.list),
                      FUN = function(GO.data,
                                     col.name) { 
                        d <- data.frame(topGO::score(GO.data))
                        colnames(d) <- gsub(pattern = "resultTopGO.",
                                            replacement = "p.value.",
                                            x = col.name)
                        d$GO.ID <- rownames(x = d)
                        d
                      },
                      SIMPLIFY = FALSE,
                      USE.NAMES = TRUE)

  data.GO <- Reduce(function(x, y) merge(x = x,
                                         y =  y,
                                         all=TRUE,
                                         by = "GO.ID"),
                    data.list)

  data.GO <- merge(x = data.GO,
                   y = tab,
                   by = "GO.ID",
                   all.y = T)

  df1 <- data.GO[data.GO$p.value.elim.ks < 0.01, ]
  df1$Term.unique <- paste(df1$Term,
                           df1$GO.ID,
                           sep = "_")
  ##

  plotdf1 <- ggplot(data = df1,
                    aes(y= -log10(p.value.elim.ks),
                        x = Term.unique,
                        fill = -log10(p.value.elim.ks))) +
    geom_bar(stat = "identity",
             color = "black", size = 0.1) +
    coord_flip() +
    scale_x_discrete(limits = df1$Term.unique[order(df1$p.value.elim.ks)]) +
    theme(aspect.ratio = 1,
          panel.background = element_rect(fill = "lightgrey")) + 
    labs(x = "") +
    scale_fill_gradientn(colours = viridis::magma(5)[-1])


  return(list(tgd = tgd,
              GO.type.df = df1,
              plot = plotdf1,
              tab = tab))
}

Hereafter is an example of the type of result I get :

GO.ID                                        Term Annotated Significant Expected Rank in Fisher.classic Fisher.elim   Fisher.classic ks.elim ks.classic
1 GO:0003724                       RNA helicase activity        17           0     1.99                    510       1.000         1.0000 8.5e-09    8.5e-09
2 GO:0008017                         microtubule binding        47           1     5.51                    506       0.997         0.9973 1.1e-07    1.1e-07
3 GO:0020037                                heme binding        14           1     1.64                    421       0.826         0.8259 1.6e-05    1.6e-05
4 GO:0003743      translation initiation factor activity         8           0     0.94                    511       1.000       1.0000 8.0e-05    8.0e-05
5 GO:0016705 oxidoreductase activity, acting on paire...        24           0     2.81                    512       1.000         1.0000 0.00012    0.00012
6 GO:0003735          structural constituent of ribosome        36           0     4.22                    513       1.000        1.0000 0.00016    0.00016

You can see from the first GO.ID, which is highly significant from the ks elim computation (8.5e-09), that there are 0 genes falling in the "Significant" column... Thanks in advance for the inputs. Best,

Al.

> sessionInfo() 
R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] patchwork_1.0.0             RColorBrewer_1.1-2          gplots_3.0.3                geneplotter_1.64.0         
 [5] annotate_1.64.0             XML_3.99-0.3                lattice_0.20-40             LSD_4.0-0                  
 [9] Rgraphviz_2.30.0            EDASeq_2.20.0               ShortRead_1.44.3            GenomicAlignments_1.22.1   
[13] Rsamtools_2.2.3             Biostrings_2.54.0           XVector_0.26.0              fdrtool_1.2.15             
[17] topGO_2.38.1                SparseM_1.78                GO.db_3.10.0                graph_1.64.0               
[21] org.Mm.eg.db_3.10.0         AnnotationDbi_1.48.0        biomaRt_2.42.1              genefilter_1.68.0          
[25] plyr_1.8.6                  stringr_1.4.0               dplyr_0.8.5                 ggrepel_0.8.2              
[29] ggplot2_3.3.0               edgeR_3.28.1                limma_3.42.2                DESeq2_1.26.0              
[33] SummarizedExperiment_1.16.1 DelayedArray_0.12.2         BiocParallel_1.20.1         matrixStats_0.56.0         
[37] Biobase_2.46.0              GenomicRanges_1.38.0        GenomeInfoDb_1.22.1         IRanges_2.20.2             
[41] S4Vectors_0.24.3            BiocGenerics_0.32.0        

loaded via a namespace (and not attached):
 [1] colorspace_1.4-1       hwriter_1.3.2          ellipsis_0.3.0         htmlTable_1.13.3       base64enc_0.1-3       
 [6] rstudioapi_0.11        farver_2.0.3           bit64_0.9-7            fansi_0.4.1            splines_3.6.3         
[11] R.methodsS3_1.8.0      DESeq_1.38.0           knitr_1.28             Formula_1.2-3          cluster_2.1.0         
[16] dbplyr_1.4.2           png_0.1-7              R.oo_1.23.0            compiler_3.6.3         httr_1.4.1            
[21] backports_1.1.5        assertthat_0.2.1       Matrix_1.2-18          cli_2.0.2              acepack_1.4.1         
[26] htmltools_0.4.0        prettyunits_1.1.1      tools_3.6.3            gtable_0.3.0           glue_1.3.2            
[31] GenomeInfoDbData_1.2.2 rappdirs_0.3.1         Rcpp_1.0.4             vctrs_0.2.4            gdata_2.18.0          
[36] rtracklayer_1.46.0     xfun_0.12              lifecycle_0.2.0        gtools_3.8.2           zlibbioc_1.32.0       
[41] scales_1.1.0           aroma.light_3.16.0     hms_0.5.3              curl_4.3               memoise_1.1.0         
[46] gridExtra_2.3          rpart_4.1-15           latticeExtra_0.6-29    stringi_1.4.6          RSQLite_2.2.0         
[51] highr_0.8              checkmate_2.0.0        caTools_1.18.0         GenomicFeatures_1.38.2 rlang_0.4.5           
[56] pkgconfig_2.0.3        bitops_1.0-6           evaluate_0.14          purrr_0.3.3            labeling_0.3          
[61] htmlwidgets_1.5.1      bit_1.1-15.2           tidyselect_1.0.0       magrittr_1.5           R6_2.4.1              
[66] Hmisc_4.4-0            DBI_1.1.0              pillar_1.4.3           foreign_0.8-76         withr_2.1.2           
[71] survival_3.1-11        RCurl_1.98-1.1         nnet_7.3-13            tibble_3.0.0           crayon_1.3.4          
[76] KernSmooth_2.23-16     BiocFileCache_1.10.2   rmarkdown_2.1          viridis_0.5.1          jpeg_0.1-8.1          
[81] progress_1.2.2         locfit_1.5-9.4         data.table_1.12.8      blob_1.2.1             digest_0.6.25         
[86] xtable_1.8-4           R.utils_2.9.2          openssl_1.4.1          munsell_0.5.0          viridisLite_0.3.0     
[91] askpass_1.1

TopGO GO • 1.5k views

ADD COMMENT • link 4.0 years ago Al-a ▴ 10

score 0 · Answer 1 · 2020-04-22

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 12 hours ago

United States

Why do you think there are no significant genes for the first row? It would make absolutely no sense for that to be true, so the first thing you should do is look at your results again, more carefully. For example, here are the first four columns for the header and first row.

GO.ID            Term Annotated        Significant   Expected 
GO:0003724     RNA helicase activity        17             0

ADD COMMENT • link 4.0 years ago James W. MacDonald 65k

0

Entering edit mode

    colnames(df$MF$tab)
 [1] "GO.ID"                  "Term"                   "Annotated"              "Significant"           
 [5] "Expected"               "Rank in Fisher.classic" "Fisher.elim"            "Fisher.classic"        
 [9] "ks.elim"                "ks.classic"

Term, Annotatedand Significant are 3 different columns. If I got it correctly, Term is the GOterm name, Annotated corresponds to the number of genes associated to this given GOterm, and Significant to the number of genes from my DESeq2 input that are within this Annotated gene-set.

So indeed yes, it makes no sense that there are no significant genes associated to the significantly enriched GOterms, which is why I am opening this thread...

ADD REPLY • link 4.0 years ago Al-a ▴ 10

0

Entering edit mode

The KS test is supposed to be testing for enrichment, but evidently not. Even the example in the vignette shows that (the first row in Table 2 shows a non-significant Fisher test and significant KS test for under-representation).

The author chocks that up to the KS test being less conservative, but my interpretation would be different?

ADD REPLY • link 4.0 years ago James W. MacDonald 65k