Hello there,
I have been trying to retreive some GO:CC ids from a previous pathway enrichment analysis using the gProfileR package. The reason for this is that I would like to have the fraction of the genes found from the enrichment analysis per GO:CC term calculated by diving the genes found in the GO:CC term by tot genes annotated in the GO:CC term. However, I get a lot of GO_CC ids which do not have any matches in the org.Hs.egGO, or the org.Hs.egGO2EG bimap objects. These are the following:
gocc_notfound <- c("GO:0031974", "GO:0150005", "GO:0045293", "GO:0120114", "GO:1905348", "GO:1902555", "GO:1905354", "GO:0106068", "GO:1902493", "GO:0061695","GO:0043226", "GO:0043228", "GO:0044422", "GO:0043233", "GO:0044464","GO:0044424", "GO:0031248", "GO:0033202", "GO:0030880", "GO:0000428","GO:0043232", "GO:0044446", "GO:0044427", "GO:0098687", "GO:0070013", "GO:0044428", "GO:0030689", "GO:0071010", "GO:0090576", "GO:0097526",
"GO:0044451", "GO:0090568", "GO:1902562", "GO:0043189", "GO:0070775", "GO:0044452", "GO:0055029", "GO:0044454", "GO:0070603", "GO:0090545", "GO:0097346", "GO:0031010", "GO:0044665", "GO:0031414", "GO:0032806","GO:0098862", "GO:0044440", "GO:1903293", "GO:0099080", "GO:0099081", "GO:0098647", "GO:0097458", "GO:0098984", "GO:0044456", "GO:0090665", "GO:0044430", "GO:0099513", "GO:0097517", "GO:0044449", "GO:0036379", "GO:0030427", "GO:0044463", "GO:0120038", "GO:0033267", "GO:0044459", "GO:0098590", "GO:0044448", "GO:0070161", "GO:0044421", "GO:0044420",
"GO:1905360", "GO:1990351", "GO:0031975", "GO:0098805", "GO:0044425", "GO:0098552", "GO:0098589", "GO:0043230", "GO:0031984", "GO:0031967", "GO:0098588", "GO:0031968", "GO:0031300", "GO:0098796", "GO:0098803", "GO:0045271", "GO:0098533", "GO:0098802", "GO:1905961", "GO:0044444", "GO:0048770", "GO:0044433", "GO:0099501", "GO:0042579", "GO:0044438", "GO:0031903", "GO:0044439", "GO:0044437", "GO:0000323", "GO:0098852", "GO:0005766", "GO:0044429", "GO:0098798", "GO:0044455", "GO:0098573", "GO:0044432", "GO:0098827", "GO:0031211", "GO:0071256", "GO:0044431",
"GO:0098791", "GO:0098800", "GO:0044453", "GO:0031229", "GO:0098563", "GO:0048475", "GO:0036452", "GO:0120111", "GO:0031082", "GO:0031907", "GO:0031970", "GO:0030062", "GO:0045240", "GO:0045283", "GO:0045275", "GO:0045281", "GO:0045257", "GO:0000313", "GO:0044391", "GO:0000314", "GO:0000315", "GO:0031332", "GO:0031983", "GO:0070993", "GO:0044445","GO:0101002", "GO:0005775", "GO:0044215", "GO:0018995", "GO:0044217", "GO:0044815", "GO:0032155","GO:0043073", "GO:0099086", "GO:0042575", "GO:0044450", "GO:0097223", "GO:0044441", "GO:0044447", "GO:0036454", "GO:0034385", "GO:0044279", "GO:0032997", "GO:0098644", "GO:0098643", "GO:0000931", "GO:0070022", "GO:0044797", "GO:0034753", "GO:0061618", "GO:0070288")
mget (gocc_notfound, org.Hs.egGO2EG)
Am I doing something wrong? Thanks in advance!
Bruno
I tried this using two versions of R and packages here is the session info for both:
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.10 (Final)
Matrix products: default
BLAS: /env/export/v_info/q_soft/ig/r-3.5.2/el6-x86_64-generic/lib64/R/lib/libRblas.so
LAPACK: /env/export/v_info/q_soft/ig/r-3.5.2/el6-x86_64-generic/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid parallel stats4 stats graphics grDevices utils
[8] datasets methods base
other attached packages:
[1] GO.db_3.7.0 gtools_3.8.1 org.Hs.eg.db_3.7.0
[4] plotrix_3.7-5 igraph_1.2.4.1 doParallel_1.0.14
[7] iterators_1.0.10 foreach_1.4.4 Hmisc_4.2-0
[10] Formula_1.2-3 survival_2.43-3 lattice_0.20-38
[13] ANF_1.4.0 gProfileR_0.6.7 openxlsx_4.1.0
[16] genefilter_1.64.0 gridExtra_2.3 plyr_1.8.4
[19] reactome.db_1.66.0 ComplexHeatmap_2.1.0 RColorBrewer_1.1-2
[22] gplots_3.0.1.1 ggplot2_3.2.0 AnnotationDbi_1.44.0
[25] IRanges_2.16.0 S4Vectors_0.20.1 Biobase_2.42.0
[28] BiocGenerics_0.28.0
loaded via a namespace (and not attached):
[1] bitops_1.0-6 bit64_0.9-7 tools_3.5.2
[4] backports_1.1.3 R6_2.4.0 rpart_4.1-13
[7] KernSmooth_2.23-15 DBI_1.0.0 lazyeval_0.2.1
[10] colorspace_1.4-1 nnet_7.3-12 GetoptLong_0.1.7
[13] withr_2.1.2 tidyselect_0.2.5 bit_1.1-14
[16] compiler_3.5.2 htmlTable_1.13.1 caTools_1.17.1.1
[19] scales_1.0.0 checkmate_1.9.0 stringr_1.3.1
[22] digest_0.6.19 foreign_0.8-71 base64enc_0.1-3
[25] pkgconfig_2.0.2 htmltools_0.3.6 htmlwidgets_1.3
[28] rlang_0.3.4 GlobalOptions_0.1.0 rstudioapi_0.9.0
[31] RSQLite_2.1.1 shape_1.4.4 bindr_0.1.1
[34] acepack_1.4.1 dplyr_0.7.8 zip_1.0.0
[37] RCurl_1.95-4.11 magrittr_1.5 Matrix_1.2-15
[40] Rcpp_1.0.1 munsell_0.5.0 stringi_1.2.4
[43] MASS_7.3-51.1 blob_1.1.1 gdata_2.18.0
[46] crayon_1.3.4 splines_3.5.2 annotate_1.60.1
[49] circlize_0.4.6 knitr_1.21 pillar_1.3.1
[52] rjson_0.2.20 codetools_0.2-16 XML_3.98-1.16
[55] glue_1.3.0 latticeExtra_0.6-28 data.table_1.11.8
[58] BiocManager_1.30.4 png_0.1-7 gtable_0.2.0
[61] purrr_0.2.5 clue_0.3-57 assertthat_0.2.0
[64] xfun_0.4 xtable_1.8-4 tibble_2.1.3
[67] memoise_1.1.0 bindrcpp_0.2.2 cluster_2.0.7-1
sessionInfo of other R:
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] org.Hs.eg.db_3.5.0 AnnotationDbi_1.40.0 IRanges_2.12.0
[4] S4Vectors_0.16.0 Biobase_2.38.0 BiocGenerics_0.24.0
[7] BiocInstaller_1.28.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 digest_0.6.19 DBI_1.0.0 RSQLite_2.1.1
[5] blob_1.1.1 tools_3.4.4 bit64_0.9-7 bit_1.1-14
[9] compiler_3.4.4 pkgconfig_2.0.2 memoise_1.1.0
Two other things. GO, as with all annotations, is a moving target and changes relatively rapidly. Using versions of R from one or two years ago will ensure that you have the GO mappings that were extant at those times, not the current versions. There is no cost to keeping your R/Bioc installation up to date, aside from a bit of your time, so you should do so.
Second,
mget
is an old way of getting data, based on the time (like ten or more years ago) that we were using environments to store the annotation data. We don't do that any more, and the API has progressed. The current way to do what you want is to use eitherselect
ormapIds
. To emulate what I did above, we can doThere is a vignette for the AnnotationDbi package that you could read to get up to speed.
Hi James,
Thanks very much for the prompt answer. Well i did have a quick look at those ids and GO:0031974 has 172,696 annotations on amiGO, among which i also find human. For your second point, I am aware of this, but since I have a significant enrichment for GO:0031974, for example, I would like to get the size of that term (in terms of number of genes annotated to that specific GOterm) from the org.Hs.eg.db for a following analysis. I also thought that, even if i work with a non-top notch R version and packages (working on HPC so not able to update that) that is unlikely to account for so many mismatches.
Thanks again, and waiting for some additional enlightenment.
Bruno
Good point. I didn't look closely at the number of terms at the top. However, I just showed you something that you could have tried rather than asking me to elaborate.
Also, the error message you get when using the GO table isn't that enlightening, but it's a general message. What it is telling you is that there are no genes that are directly annotated to those terms (e.g., there are genes that are annotated to those GO terms, but by virtue of being annotated directly to progeny terms, and due to the DAG structure being inherited upwards). To show this we can dig around in the guts of the underlying DB:
So like I said, there are no direct GO -> EG mappings for that set of 161 GO terms, according to AmiGO, as of Oct 2018
All right! I got it now, I was a bit puzzled because I got those terms associated with genes in my GO enrichment analysis and because i could find gene entries in amiGO but in both cases they are not directly associated with them, but as you showed me, they are with progeny terms. Thanks a lot!