Question: Cannot find more than a 100 GO:CC ids on org.Hs.eg.db
0
4 weeks ago by
brunogiotti0 wrote:

Hello there,

I have been trying to retreive some GO:CC ids from a previous pathway enrichment analysis using the gProfileR package. The reason for this is that I would like to have the fraction of the genes found from the enrichment analysis per GO:CC term calculated by diving the genes found in the GO:CC term by tot genes annotated in the GO:CC term. However, I get a lot of GO_CC ids which do not have any matches in the org.Hs.egGO, or the org.Hs.egGO2EG bimap objects. These are the following:

gocc_notfound <- c("GO:0031974", "GO:0150005", "GO:0045293", "GO:0120114", "GO:1905348", "GO:1902555", "GO:1905354", "GO:0106068", "GO:1902493", "GO:0061695","GO:0043226", "GO:0043228", "GO:0044422", "GO:0043233", "GO:0044464","GO:0044424", "GO:0031248", "GO:0033202", "GO:0030880", "GO:0000428","GO:0043232", "GO:0044446", "GO:0044427", "GO:0098687", "GO:0070013", "GO:0044428", "GO:0030689", "GO:0071010", "GO:0090576", "GO:0097526",
"GO:0044451", "GO:0090568", "GO:1902562", "GO:0043189", "GO:0070775", "GO:0044452", "GO:0055029", "GO:0044454", "GO:0070603", "GO:0090545", "GO:0097346", "GO:0031010", "GO:0044665", "GO:0031414", "GO:0032806","GO:0098862", "GO:0044440", "GO:1903293", "GO:0099080", "GO:0099081", "GO:0098647", "GO:0097458", "GO:0098984", "GO:0044456", "GO:0090665", "GO:0044430", "GO:0099513", "GO:0097517", "GO:0044449", "GO:0036379", "GO:0030427", "GO:0044463", "GO:0120038", "GO:0033267", "GO:0044459", "GO:0098590", "GO:0044448", "GO:0070161", "GO:0044421", "GO:0044420",
"GO:1905360", "GO:1990351", "GO:0031975", "GO:0098805", "GO:0044425", "GO:0098552", "GO:0098589", "GO:0043230", "GO:0031984", "GO:0031967", "GO:0098588", "GO:0031968", "GO:0031300", "GO:0098796", "GO:0098803", "GO:0045271", "GO:0098533", "GO:0098802", "GO:1905961", "GO:0044444", "GO:0048770", "GO:0044433", "GO:0099501", "GO:0042579", "GO:0044438", "GO:0031903", "GO:0044439", "GO:0044437", "GO:0000323", "GO:0098852", "GO:0005766", "GO:0044429", "GO:0098798", "GO:0044455", "GO:0098573", "GO:0044432", "GO:0098827", "GO:0031211", "GO:0071256", "GO:0044431",
"GO:0098791", "GO:0098800", "GO:0044453", "GO:0031229", "GO:0098563", "GO:0048475", "GO:0036452", "GO:0120111", "GO:0031082", "GO:0031907", "GO:0031970", "GO:0030062", "GO:0045240", "GO:0045283", "GO:0045275", "GO:0045281", "GO:0045257", "GO:0000313", "GO:0044391", "GO:0000314", "GO:0000315", "GO:0031332", "GO:0031983", "GO:0070993", "GO:0044445","GO:0101002", "GO:0005775", "GO:0044215", "GO:0018995", "GO:0044217", "GO:0044815", "GO:0032155","GO:0043073", "GO:0099086", "GO:0042575", "GO:0044450", "GO:0097223", "GO:0044441", "GO:0044447", "GO:0036454", "GO:0034385", "GO:0044279", "GO:0032997", "GO:0098644", "GO:0098643", "GO:0000931", "GO:0070022", "GO:0044797", "GO:0034753", "GO:0061618", "GO:0070288")
mget (gocc_notfound, org.Hs.egGO2EG)



Am I doing something wrong? Thanks in advance!

Bruno

I tried this using two versions of R and packages here is the session info for both:

R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.10 (Final)

Matrix products: default
BLAS: /env/export/v_info/q_soft/ig/r-3.5.2/el6-x86_64-generic/lib64/R/lib/libRblas.so
LAPACK: /env/export/v_info/q_soft/ig/r-3.5.2/el6-x86_64-generic/lib64/R/lib/libRlapack.so

locale:
[1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8        LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] grid      parallel  stats4    stats     graphics  grDevices utils
[8] datasets  methods   base

other attached packages:
[1] GO.db_3.7.0          gtools_3.8.1         org.Hs.eg.db_3.7.0
[4] plotrix_3.7-5        igraph_1.2.4.1       doParallel_1.0.14
[7] iterators_1.0.10     foreach_1.4.4        Hmisc_4.2-0
[10] Formula_1.2-3        survival_2.43-3      lattice_0.20-38
[13] ANF_1.4.0            gProfileR_0.6.7      openxlsx_4.1.0
[16] genefilter_1.64.0    gridExtra_2.3        plyr_1.8.4
[19] reactome.db_1.66.0   ComplexHeatmap_2.1.0 RColorBrewer_1.1-2
[22] gplots_3.0.1.1       ggplot2_3.2.0        AnnotationDbi_1.44.0
[25] IRanges_2.16.0       S4Vectors_0.20.1     Biobase_2.42.0
[28] BiocGenerics_0.28.0

loaded via a namespace (and not attached):
[1] bitops_1.0-6        bit64_0.9-7         tools_3.5.2
[4] backports_1.1.3     R6_2.4.0            rpart_4.1-13
[7] KernSmooth_2.23-15  DBI_1.0.0           lazyeval_0.2.1
[10] colorspace_1.4-1    nnet_7.3-12         GetoptLong_0.1.7
[13] withr_2.1.2         tidyselect_0.2.5    bit_1.1-14
[16] compiler_3.5.2      htmlTable_1.13.1    caTools_1.17.1.1
[19] scales_1.0.0        checkmate_1.9.0     stringr_1.3.1
[22] digest_0.6.19       foreign_0.8-71      base64enc_0.1-3
[25] pkgconfig_2.0.2     htmltools_0.3.6     htmlwidgets_1.3
[28] rlang_0.3.4         GlobalOptions_0.1.0 rstudioapi_0.9.0
[31] RSQLite_2.1.1       shape_1.4.4         bindr_0.1.1
[34] acepack_1.4.1       dplyr_0.7.8         zip_1.0.0
[37] RCurl_1.95-4.11     magrittr_1.5        Matrix_1.2-15
[40] Rcpp_1.0.1          munsell_0.5.0       stringi_1.2.4
[43] MASS_7.3-51.1       blob_1.1.1          gdata_2.18.0
[46] crayon_1.3.4        splines_3.5.2       annotate_1.60.1
[49] circlize_0.4.6      knitr_1.21          pillar_1.3.1
[52] rjson_0.2.20        codetools_0.2-16    XML_3.98-1.16
[55] glue_1.3.0          latticeExtra_0.6-28 data.table_1.11.8
[58] BiocManager_1.30.4  png_0.1-7           gtable_0.2.0
[61] purrr_0.2.5         clue_0.3-57         assertthat_0.2.0
[64] xfun_0.4            xtable_1.8-4        tibble_2.1.3
[67] memoise_1.1.0       bindrcpp_0.2.2      cluster_2.0.7-1


sessionInfo of other R:

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8        LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] org.Hs.eg.db_3.5.0   AnnotationDbi_1.40.0 IRanges_2.12.0
[4] S4Vectors_0.16.0     Biobase_2.38.0       BiocGenerics_0.24.0
[7] BiocInstaller_1.28.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.1      digest_0.6.19   DBI_1.0.0       RSQLite_2.1.1
[5] blob_1.1.1      tools_3.4.4     bit64_0.9-7     bit_1.1-14
[9] compiler_3.4.4  pkgconfig_2.0.2 memoise_1.1.0

modified 4 weeks ago by James W. MacDonald50k • written 4 weeks ago by brunogiotti0
Answer: Cannot find more than a 100 GO:CC ids on org.Hs.eg.db
1
4 weeks ago by
United States
James W. MacDonald50k wrote:

Two problems. First, it appears that set of GO terms are (from a quick look at the AmiGO site, which is something you should have already tried) from rat, not human. As an example, GO:0031974

And second, when you do a GO analysis you aren't just using the direct mappings of a GO term to its associated gene. You are using that GO term and any progeny terms.

> zz <- mget(gocc_notfound, org.Rn.egGO2ALLEGS, ifnotfound = NA)
> table(sapply(zz, length))

1     4     5     6     7     8     9    12    13    14    16    17    18
6     1     1     2     3     1     1     1     1     1     4     2     1
19    20    22    23    24    26    27    29    30    31    32    33    36
1     3     2     3     2     1     1     1     1     1     2     1     2
37    41    44    45    46    48    49    51    53    55    58    80    81
1     2     2     1     2     3     1     3     1     1     1     1     1
91    93    94   107   110   115   118   125   131   133   138   139   142
1     1     1     2     1     1     1     1     1     3     1     1     1
158   167   179   180   183   195   222   231   232   235   242   275   296
1     1     1     1     1     2     1     1     1     1     1     1     1
302   313   316   319   328   340   351   383   384   392   414   439   506
1     1     1     1     1     1     1     1     1     1     1     1     1
528   529   533   549   555   610   649   696   702   716   768   817   858
1     1     1     1     1     1     1     1     1     1     1     1     1
898   974  1014  1022  1175  1203  1393  1433  1434  1499  1742  1806  1807
1     1     1     1     1     1     1     1     1     1     1     1     1
1976  2157  2171  2203  2244  2267  2766  2767  2880  2977  3737  4992  7482
2     1     1     1     1     1     1     1     1     1     1     1     1
7483  7837  7873  7914 12190 15140 15927 17001 25777 29875 37101
2     1     1     1     1     1     1     1     1     1     1

> sum(sapply(zz, function(x) allis.na(x))))
[1] 5
>


Two other things. GO, as with all annotations, is a moving target and changes relatively rapidly. Using versions of R from one or two years ago will ensure that you have the GO mappings that were extant at those times, not the current versions. There is no cost to keeping your R/Bioc installation up to date, aside from a bit of your time, so you should do so.

Second, mget is an old way of getting data, based on the time (like ten or more years ago) that we were using environments to store the annotation data. We don't do that any more, and the API has progressed. The current way to do what you want is to use either select or mapIds. To emulate what I did above, we can do


> zz2 <- mapIds(org.Rn.eg.db, gocc_notfound, "ENTREZID", "GOALL", multiVals = "list")
'select()' returned 1:many mapping between keys and columns
> table(sapply(zz2, length))

1     4     5     6     7     8     9    12    13    14    16    17    18
6     1     1     2     3     1     1     1     1     1     4     2     1
19    20    22    23    24    26    27    29    30    31    32    33    36
1     3     2     3     2     1     1     1     1     1     2     1     2
37    41    44    45    46    48    49    51    53    55    58    80    81
1     2     2     1     2     3     1     3     1     1     1     1     1
91    93    94   107   110   115   118   125   131   133   138   139   142
1     1     1     2     1     1     1     1     1     3     1     1     1
158   167   179   180   183   195   222   231   232   235   242   275   296
1     1     1     1     1     2     1     1     1     1     1     1     1
302   313   316   319   328   340   351   383   384   392   414   439   506
1     1     1     1     1     1     1     1     1     1     1     1     1
528   529   533   549   555   610   649   696   702   716   768   817   858
1     1     1     1     1     1     1     1     1     1     1     1     1
898   974  1014  1022  1175  1203  1393  1433  1434  1499  1742  1806  1807
1     1     1     1     1     1     1     1     1     1     1     1     1
1976  2157  2171  2203  2244  2267  2766  2767  2880  2977  3737  4992  7482
2     1     1     1     1     1     1     1     1     1     1     1     1
7483  7837  7873  7914 12190 15140 15927 17001 25777 29875 37101
2     1     1     1     1     1     1     1     1     1     1
> sum(sapply(zz2, function(x) allis.na(x))))
[1] 5



There is a vignette for the AnnotationDbi package that you could read to get up to speed.

Hi James,

Thanks very much for the prompt answer. Well i did have a quick look at those ids and GO:0031974 has 172,696 annotations on amiGO, among which i also find human. For your second point, I am aware of this, but since I have a significant enrichment for GO:0031974, for example, I would like to get the size of that term (in terms of number of genes annotated to that specific GOterm) from the org.Hs.eg.db for a following analysis. I also thought that, even if i work with a non-top notch R version and packages (working on HPC so not able to update that) that is unlikely to account for so many mismatches.

Thanks again, and waiting for some additional enlightenment.

Bruno

1

Good point. I didn't look closely at the number of terms at the top. However, I just showed you something that you could have tried rather than asking me to elaborate.

> zz <- mapIds(org.Hs.eg.db, gocc_notfound, "ENTREZID","GOALL", multiVals = "list")
'select()' returned 1:many mapping between keys and columns
> table(sapply(zz, length))

1     3     4     5     6     9    10    11    13    15    16    17    18
4     3     1     2     1     7     3     1     1     2     1     3     2
19    22    25    27    28    29    30    31    36    38    40    43    44
1     3     2     1     1     1     2     3     5     1     2     1     1
47    48    49    55    59    60    64    70    72    78    89    92    99
1     2     1     2     1     1     1     1     1     1     1     1     1
103   104   109   122   125   127   144   146   155   161   162   168   169
1     1     1     1     1     1     2     1     1     3     2     1     1
170   175   180   185   195   214   222   224   265   268   273   277   281
1     3     1     1     1     1     1     1     1     1     1     1     1
282   343   355   381   391   396   397   421   429   440   485   504   519
1     1     1     1     1     1     1     1     1     1     1     1     1
602   619   643   668   683   803   809   894  1072  1160  1161  1229  1260
1     1     1     1     1     1     1     1     1     1     1     1     1
1390  1396  1443  1462  1747  1759  1762  1774  1824  1831  2243  2282  2295
1     1     1     1     1     2     1     1     2     1     1     1     1
2324  2327  2917  4130  4808  6849  6859  7982  8446  9991 16603 17271 17825
1     1     1     1     1     1     1     1     3     1     1     1     1
28076 31748 39049
1     1     1
> sum(sapply(zz, function(x) allis.na(x))))
[1] 3



Also, the error message you get when using the GO table isn't that enlightening, but it's a general message. What it is telling you is that there are no genes that are directly annotated to those terms (e.g., there are genes that are annotated to those GO terms, but by virtue of being annotated directly to progeny terms, and due to the DAG structure being inherited upwards). To show this we can dig around in the guts of the underlying DB:

## get the connection to the database
> con <- org.Hs.eg_dbconn()
## need DBI for this part
> library(DBI)
> dbListTables(con)
[1] "accessions"            "alias"                 "chrlengths"
[4] "chromosome_locations"  "chromosomes"           "cytogenetic_locations"
[7] "ec"                    "ensembl"               "ensembl2ncbi"
[10] "ensembl_prot"          "ensembl_trans"         "gene_info"
[13] "genes"                 "go"                    "go_all"
[16] "go_bp"                 "go_bp_all"             "go_cc"
[19] "go_cc_all"             "go_mf"                 "go_mf_all"
[28] "pfam"                  "prosite"               "pubmed"
[31] "refseq"                "sqlite_stat1"          "sqlite_stat4"
[34] "ucsc"                  "unigene"               "uniprot"

## let's look at what is in the go_cc_all and go_cc tables
> dbGetQuery(con, "select * from go_cc_all limit 5;")
_id      go_id evidence
1    1 GO:0005575      HDA
2    1 GO:0005575      IDA
3    1 GO:0005575      TAS
4    1 GO:0005576      HDA
5    1 GO:0005576      IDA
> dbGetQuery(con, "select * from go_cc limit 5;")
_id      go_id evidence
1    1 GO:0005576      HDA
2    1 GO:0005576      IDA
3    1 GO:0005576      TAS
4    1 GO:0005615      HDA
5    1 GO:0031093      TAS
## so identical structure. Let's query the go_cc table, which is direct GO -> EGID mappings
> z.sql <- dbGetQuery(con, paste("select * from go_cc where go_id in ('", paste(gocc_notfound, collapse = "','"), "');"))
> z.sql
[1] _id      go_id    evidence
<0 rows> (or 0-length row.names)
## or in a more R-like way
> all.dat <- dbGetQuery(con, "select go_id from go_cc;")
> sum(gocc_notfound %in% all.dat[,1])
[1] 0



So like I said, there are no direct GO -> EG mappings for that set of 161 GO terms, according to AmiGO, as of Oct 2018

> dbGetQuery(con, "select * from map_metadata where map_name='GO';")
map_name   source_name
1       GO Gene Ontology
source_url source_date
1 ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/  2018-Oct10