Question

Why do I get different GO term sizes using the same gene universe for 2 analyses with GOstats?

0

Entering edit mode

tlaguna ▴ 10

@tlaguna-12370

Last seen 7.2 years ago

IMB Mainz

Checking different results of hypergeometric tests using GOstats, I found that, surprisingly, using the same gene universe for 2 different gene enriched sets, there are different number of GO term sizes for the same GO term:

An example

Gene set enr: 1480

GOBPID	Pvalue	OddsRatio	ExpCount	Count	Size	Term
GO:0098609	0.0000012	2.2956084	38	64	149	cell-cell adhesion

Gene set enr: 68

GOBPID	Pvalue	OddsRatio	ExpCount	Count	Size	Term
GO:0098609	0.0005064	3.5353535	4	12	342	cell-cell adhesion

GENE UNIVERSE (same for both analyses): 5811

Why the "Size" (149 vs. 342) is different between both analyses?

Here it is the code:

GOmodes <- c("BP", "MF", "CC")
param_list <- list()
params <- NULL
for (i in seq_along(GOmodes)){
  params <- new("GOHyperGParams", geneIds=entrez_geneSet_1, universeGeneIds=entrezgeneUniverse,
                ontology=GOmodes[i], annotation="org.Mm.eg.db", pvalueCutoff=0.01, conditional=TRUE,
                testDirection="over")
  param_list[[i]] <- params
}

# run the test in a loop
GO.results.bp_1 <- lapply(param_list, hyperGTest)
GO.results.bp_1  

sapply(GO.results.bp_1, htmlReport, file=file.path(out_dir, "GO.results_1.html"),
       append=T, digits=7)

param_list <- list()
params <- NULL
for (i in seq_along(GOmodes)){
  params <- new("GOHyperGParams", geneIds=entrez_geneSet_2, universeGeneIds=entrezgeneUniverse,
                ontology=GOmodes[i], annotation="org.Mm.eg.db", pvalueCutoff=0.01, conditional=TRUE,
                testDirection="over")
  param_list[[i]] <- params
}

# run the test in a loop
GO.results.bp_2 <- lapply(param_list, hyperGTest)
GO.results.bp_2  

sapply(GO.results.bp_2, htmlReport, file=file.path(out_dir, "GO.results_2.html"),
       append=T, digits=7)

When I check the results in the html file or check them by

summary(GO.results.bp_1) 

summary(GO.results.bp_2)

I find the results commented before.

sessionInfo():

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] stats4    grid      parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] org.Mm.eg.db_3.4.0     GOstats_2.40.0         graph_1.52.0           Category_2.40.0        Matrix_1.2-8          
 [6] AnnotationDbi_1.36.2   IRanges_2.8.1          S4Vectors_0.12.1       pcaMethods_1.66.0      Biobase_2.34.0        
[11] BiocGenerics_0.20.0    gplots_3.0.1           VennDiagram_1.6.17     futile.logger_1.4.3    biomaRt_2.30.0        
[16] rMQanalysis_0.3.1.9001 RColorBrewer_1.1-2     ggrepel_0.6.5          ggplot2_2.2.1          reshape2_1.4.2        
[21] plyr_1.8.4            

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9            lattice_0.20-34        GO.db_3.4.0            tidyr_0.6.1            zoo_1.7-14            
 [6] gtools_3.5.0           assertthat_0.1         digest_0.6.12          R6_2.2.0               futile.options_1.0.0  
[11] RSQLite_1.1-2          httr_1.2.1             lazyeval_0.2.0         data.table_1.10.4      annotate_1.52.1       
[16] gdata_2.17.0           splines_3.3.2          stringr_1.1.0          htmlwidgets_0.8        RCurl_1.95-4.8        
[21] munsell_0.4.3          base64enc_0.1-3        htmltools_0.3.5        tibble_1.2             XML_3.98-1.5          
[26] AnnotationForge_1.16.1 viridisLite_0.1.3      dplyr_0.5.0            bitops_1.0-6           RBGL_1.50.0           
[31] jsonlite_1.2           xtable_1.8-2           GSEABase_1.36.0        gtable_0.2.0           DBI_0.5-1             
[36] magrittr_1.5           scales_0.4.1           KernSmooth_2.23-15     stringi_1.1.2          genefilter_1.56.0     
[41] sp_1.2-4               splitstackshape_1.4.2  lambda.r_1.1.9         tools_3.3.2            logspline_2.1.9       
[46] purrr_0.2.2            survival_2.40-1        colorspace_1.3-2       caTools_1.17.1         memoise_1.0.0         
[51] plotly_4.5.6

gostats hypergeometric test • 1.5k views

ADD COMMENT • link updated 7.2 years ago by James W. MacDonald 65k • written 7.2 years ago by tlaguna ▴ 10

score 4 · Answer 1 · 2017-02-17

Please note that the GO ontology is a directed acyclic graph (DAG), which operates like a dendrogram, where lower terms are combined into higher terms. So if you have a parent term that has two child terms, then all the genes that are annotated to the two child terms are also by default annotated to the parent term. So if you have a (fake) term called 'cellular signaling' and it has two child terms (also fake!) called 'positive cellular signaling' and 'negative cellular signaling', then ALL the genes that are considered to be 'negative cellular signaling' genes are also by default considered to be 'cellular signaling' genes as well. The same is true of 'positive cellular signaling' genes.

If you do a hypergeometric test for one of the child terms, and it is significant, then you could argue that testing at the parent level and still including all the genes that gave rise to a significant result in a child term would not be ideal, because the parent term might end up being significant too, but only because of the contribution from the genes that caused the child term to be significant. In general we want more significance at the lower branches of the DAG, because those are more specific terms that may imply perturbed pathways. But the higher terms are more broad and not as easily interpreted. For example, having a significant 'biological process' term is just boring and uninterpretable.

This is the basis of the 'conditional' hypergeometric test, which is what you use when you set conditional = TRUE (which is what you did). If a child term is significant, then the genes that gave rise to that significant result are eliminated from the test of the parent term. So the number of genes that are annotated to a particular GO term are dependent on how many of those genes were eliminated at lower levels of the GO DAG. If you set conditional = FALSE, the category size remains fixed and will be the same for both tests.