Updating org.Sc.sgd.db, and integration with topGO
Entering edit mode
Last seen 5 months ago
United States

How does the maintenance of org.Sc.sgd.db work? Is there a timeline for updating? I believe that the current s.serevisiae S288C_R64 annotation set is 3.1.


but the org.Sc.sgd.db is 2.1.

Somewhat related, I have been trying to test out a couple of the functional analysis suites -- EnrichmentBrowser and GeneTonic among them -- and have run into the same error with using org.Sc.sgd.db. The same error occurs in pcaExplorer::pca2go.

I believe each has topGO as a dependency, and I get the same error in topGO. So, my suspicion is that topGO somewhere requires SYMBOL, which is not a column of the s.cer org package:

> columns(org.Sc.sgd.db)
[10] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "INTERPRO"     "ONTOLOGY"     "ONTOLOGYALL"  "ORF"          "PATH"        
[19] "PFAM"         "PMID"         "REFSEQ"       "SGD"          "SMART"        "UNIPROT"

I haven't found a method in the topGO arguments to correctly re-direct whatever it is that SYMBOL is used for, and I have found that SYMBOL is actually hardcoded into some of these packages built on top of topGO (pcaExplorer, for example).

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pcaExplorer_2.20.1          topGO_2.46.0                SparseM_1.81                GO.db_3.14.0               
 [5] graph_1.72.0                here_1.0.1                  org.Sc.sgd.db_3.14.0        AnnotationDbi_1.56.2       
 [9] forcats_0.5.1               stringr_1.4.0               dplyr_1.0.7                 purrr_0.3.4                
[13] readr_2.1.2                 tidyr_1.2.0                 tibble_3.1.6                ggplot2_3.3.5              
[17] tidyverse_1.3.1             DESeq2_1.34.0               SummarizedExperiment_1.24.0 Biobase_2.54.0             
[21] MatrixGenerics_1.6.0        matrixStats_0.61.0          GenomicRanges_1.46.1        GenomeInfoDb_1.30.1        
[25] IRanges_2.28.0              S4Vectors_0.32.3            BiocGenerics_0.40.0        

loaded via a namespace (and not attached):
  [1] GOstats_2.60.0         readxl_1.3.1           backports_1.4.1        BiocFileCache_2.2.1    NMF_0.23.0            
  [6] plyr_1.8.6             igraph_1.2.11          lazyeval_0.2.2         GSEABase_1.56.0        shinydashboard_0.7.2  
 [11] splines_4.1.2          BiocParallel_1.28.3    crosstalk_1.2.0        gridBase_0.4-7         digest_0.6.29         
 [16] foreach_1.5.2          htmltools_0.5.2        viridis_0.6.2          fansi_1.0.2            magrittr_2.0.2        
 [21] memoise_2.0.1          cluster_2.1.2          doParallel_1.0.17      limma_3.50.0           tzdb_0.2.0            
 [26] Biostrings_2.62.0      annotate_1.72.0        modelr_0.1.8           prettyunits_1.1.1      colorspace_2.0-2      
 [31] blob_1.2.2             rvest_1.0.2            rappdirs_0.3.3         ggrepel_0.9.1          haven_2.4.3           
 [36] xfun_0.29              crayon_1.4.2           RCurl_1.98-1.5         jsonlite_1.7.3         genefilter_1.76.0     
 [41] survival_3.2-13        iterators_1.0.14       glue_1.6.1             registry_0.5-1         gtable_0.3.0          
 [46] zlibbioc_1.40.0        XVector_0.34.0         webshot_0.5.2          DelayedArray_0.20.0    Rgraphviz_2.38.0      
 [51] scales_1.1.1           pheatmap_1.0.12        DBI_1.1.2              rngtools_1.5.2         Rcpp_1.0.8            
 [56] viridisLite_0.4.0      xtable_1.8-4           progress_1.2.2         bit_4.0.4              DT_0.20               
 [61] AnnotationForge_1.36.0 htmlwidgets_1.5.4      httr_1.4.2             threejs_0.3.3          shinyAce_0.4.1        
 [66] RColorBrewer_1.1-2     ellipsis_0.3.2         pkgconfig_2.0.3        XML_3.99-0.8           dbplyr_2.1.1          
 [71] locfit_1.5-9.4         utf8_1.2.2             tidyselect_1.1.1       rlang_1.0.1            reshape2_1.4.4        
 [76] later_1.3.0            munsell_0.5.0          cellranger_1.1.0       tools_4.1.2            cachem_1.0.6          
 [81] cli_3.1.1              generics_0.1.2         RSQLite_2.2.9          broom_0.7.12           shinyBS_0.61          
 [86] evaluate_0.14          fastmap_1.1.0          heatmaply_1.3.0        knitr_1.37             bit64_4.0.5           
 [91] fs_1.5.2               dendextend_1.15.2      KEGGREST_1.34.0        RBGL_1.70.0            mime_0.12             
 [96] xml2_1.3.3             biomaRt_2.50.3         compiler_4.1.2         rstudioapi_0.13        plotly_4.10.0         
[101] filelock_1.0.2         curl_4.3.2             png_0.1-7              reprex_2.0.1           geneplotter_1.72.0    
[106] stringi_1.7.6          lattice_0.20-45        Matrix_1.4-0           vctrs_0.3.8            pillar_1.7.0          
[111] lifecycle_1.0.1        data.table_1.14.2      bitops_1.0-7           seriation_1.3.1        httpuv_1.6.5          
[116] R6_2.5.1               TSP_1.1-11             promises_1.2.0.1       gridExtra_2.3          codetools_0.2-18      
[121] assertthat_0.2.1       Category_2.60.0        pkgmaker_0.32.2        rprojroot_2.0.2        withr_2.4.3           
[126] GenomeInfoDbData_1.2.7 parallel_4.1.2         hms_1.1.1              grid_4.1.2             rmarkdown_2.11        
[131] shiny_1.7.1            lubridate_1.8.0        base64enc_0.1-3
GeneTonic org.Sc.sgd.db pcaExplorer topGO EnrichmentBrowser • 2.1k views
Entering edit mode
Last seen 2 days ago
United States

All annotation packages are updated for each release. It's a complicated, convoluted process.

What are you using to infer the 'annotation' for the current org.Sc.sgd.db? I believe what you are referring to is the genome build, which is orthogonal to the annotation. In other words, the ORF or GO term or SGD ID for a yeast gene has nothing to do with the underlying position of the gene in the genome. There isn't usually a build or a release for annotation data, mainly because it all comes from disparate sources. The OrgDb contains data from SGD, NCBI, EBI/EMBL, PFAM, GO, etc, all of whom have their own processes for updating their information and different timelines for releasing the updates. We just grab the latest data from each source and use that to build the OrgDb. At release, it's up to date, but gets stale over the six months until the next release.

Anyway, most of the OrgDb packages have the HGNC symbol, but SGD doesn't have those. Instead, they have Systematic Names (what we call 'ORF', for historical reasons that I don't remember at present). Any package that is meant to do GO analyses should be able to accommodate that fact, but I don't know anything about the packages you mention, so I cannot speak to that point. Well, I do know something about topGO, but I don't know how the other packages interact with it to do whatever they are doing. To use topGO directly you just need a 'gene2go' object, like

> gn2go <- mapIds(org.Sc.sgd.db, keys(org.Sc.sgd.db), "GOALL", "ORF", multiVals = "list")
'select()' returned 1:many mapping between keys and columns
## filter out NA mappings
 gn2go <- gn2go[!sapply(gn2go, function(x) all(is.na(x)))]

> lapply(gn2go[1:5], "[", 1:5)
[1] "GO:0000462" "GO:0003674" "GO:0005575" "GO:0006139" "GO:0006364"

[1] "GO:0003674" "GO:0005575" "GO:0005622" "GO:0005634" "GO:0006139"

[1] "GO:0000245" "GO:0000245" "GO:0000348" "GO:0000375" "GO:0000375"

[1] "GO:0000171" "GO:0000171" "GO:0000172" "GO:0000172" "GO:0003674"

[1] "GO:0003674" "GO:0005575" "GO:0005622" "GO:0005634" "GO:0006139"

And maybe you can pass that into pcaExplorer::pca2go?

Alternatively, GOstats just works.

## get 200 random IDs
> fakeids <- keys(org.Sc.sgd.db)[sample(1:length(keys(org.Sc.sgd.db)), 200)]
## make a GOHyperGParams object
> p <- new("GOHyperGParams", geneIds = fakeids, universeGeneIds = keys(org.Sc.sgd.db), ontology = "BP", annotation = "org.Sc.sgd.db")
## do the test
> hyp <- hyperGTest(p)
## output the top terms, contingent upon the term having at least 10 genes
> summary(hyp, categorySize = 10)
      GOBPID      Pvalue OddsRatio  ExpCount Count Size
1 GO:0007076 0.002312953 14.573758 0.2806035     3   11
2 GO:0006271 0.005182110  6.501042 0.7142635     4   28
3 GO:0022616 0.007505879  5.775926 0.7907917     4   31
1                   mitotic chromosome condensation
2 DNA strand elongation involved in DNA replication
3                             DNA strand elongation
Entering edit mode

As James correctly pointed out, it can be that the column names across different species are just not the same, but this is due to the different sources one needs to pick all the information from. Having the whole wealth of information in an annotation package is quite a convenience, and I guess that a change in the naming scheme for the different identifiers might break some existing code?

The get_annotation_orgdb function is nothing but a wrapper to mapIds - you mentioned the erroneous behavior in this issue. For that, you can just replace the call with the explicit lines James suggested you in his reply.

From my pcaExplorer side, I'll try to have it flagged somewhere. It is not topGO requesting a SYMBOL column, but rather "our task" to provide the right kind of identifiers, and tell the functions to handle them as expected.

Entering edit mode

Thank you very much. I really appreciate the help.

the link to SGD above shows that the S288C_R64 annotation set 3.1 were released 2021-04, which I take to mean that they should be in the bioconductor release from October?

In any case, I am more or less happy to know that they do get updated, in some way shape or fashion, with each release, if there are updates to be had.

Entering edit mode

No, the link you provided shows the genome releases. As I already explained, that is not the same thing as an 'annotation' release.

Entering edit mode

Understood, though I think my point stands. The org.Sc.sgd.db package currently has data from 2019, which is genome version 2.1.

> org.Sc.sgd.db
OrgDb object:
| Db type: OrgDb
| Supporting package: AnnotationDbi
| ORGANISM: Saccharomyces cerevisiae
| SPECIES: Yeast
| YGSOURCENAME: Yeast Genome
| YGSOURCEURL: http://sgd-archive.yeastgenome.org
| YGSOURCEDATE: 2019-Oct25

This is where I am misunderstanding -- if the org package tracks with the bioconductor updates, then it should have genome version 3.1 and the corresponding annotations, since those were released by SGD in April of 2021, no?

Here is a better way of asking this: it seems like what you're saying is that there are no new annotations in the 3-1 genome build annotation set, and so no reason to update the org package. Is that what I am to understand?

I don't think that is the case. Here is an example of a ID in 3-1 that is not in 2-1 (admittedly, I could maybe find a more exciting example. there are a number):

> grep YNCL0014C ~/Downloads/saccharomyces_cerevisiae_R64-3-1_20210421.gff 
chrXII  SGD rRNA_gene   455414  455571  .   -   .   ID=YNCL0014C;Name=YNCL0014C;gene=RDN58-1;Alias=RDN58-1,5.8S%20ribosomal%20RNA,RDN58;Ontology_term=GO:0002181,GO:0003735,GO:0022625,SO:0000704;Note=5.8S%20ribosomal%20RNA%20%285.8S%20rRNA%29%3B%20component%20of%20the%20large%20%2860S%29%20ribosomal%20subunit%3B%20encoded%20in%20the%20rDNA%20repeat%20%28RDN1%29%20as%20part%20of%20the%2035S%20primary%20transcript%3B%203'%20end%20formation%20involves%20processing%20by%20the%20exosome%20complex%20while%205'%20end%20is%20generated%20by%20Kem1p%20and%20Rat1p;display=5.8S%20ribosomal%20RNA%20%285.8S%20rRNA%29;dbxref=SGD:S000006488;curie=SGD:S000006488
chrXII  SGD noncoding_exon  455414  455571  .   -   .   Parent=YNCL0014C_rRNA;Name=YNCL0014C_noncoding_exon
chrXII  SGD rRNA    455414  455571  .   -   .   ID=YNCL0014C_rRNA;Name=YNCL0014C_rRNA;Parent=YNCL0014C

> grep YNCL0014C ~/Downloads/S288C_reference_genome_R64-2-1_20150113/saccharomyces_cerevisiae_R64-2-1_20150113.gff
Entering edit mode

The OrgDb packages are meant to only provide annotation data, not positional data. There are positional data in these packages, but that's because we originally only had the OrgDb packages but have since introduced the TxDb packages that are meant to provide positional data. It is difficult to change these things; some of the existing codebase still references positional data using the OrgDb packages, and to remove the positional data requires someone (lately me) to go through all the code used to build the packages, as well as the code meant to interact with the packages and remove that code without negatively affecting all the other code that is still supposed to work. But since the positional data are now in the TxDb packages, it's not updated in the OrgDb.

Changing things like that is a non-trivial task. It took me 80 hours for both the Fall 2020 and Spring 2021 releases to make similarly trivial appearing changes. So, while expunging the positional data from the OrgDb packages is a goal, it's not clear when (or if) that will happen.

Which brings me to another point. We generate these annotation packages by first downloading data from various sources (for yeast, mostly sgd-archive.yeastgenome.org), parsing, and then loading the data into SQLite databases. For any of this to work, several things have to be true. First, the location of the data has to remain relatively constant. We use bash scripts to get the data, and they are semi-sophisticated, but if things change dramatically at the FTP site, then the scripts stop working. Second, the format of the data has to remain constant. The scripts used to parse the data make assumptions, based on the previous versions of the files. If things change the assumptions are no longer valid. Third, the data has to be updated. In Fall 2020 I spent 80 hours updating the code to parse the GO data because the GO people stopped providing the data in SQL dump format, and only provided updated data in their OWL format.

For the yeast OrgDb, we primarily use the SGD_features.tab file, which has this line:

$ grep S000006488 SGD_features.tab
S000006488      rRNA_gene               RDN58-1 RDN58-1 RDN58   chromosome 12           12      455571  455414  C               2011-02-03      2000-05-19      5.8S ribosomal RNA (5.8S rRNA); component of the large (60S) ribosomal subunit; encoded in the rDNA repeat (RDN1) as part of the 35S primary transcript; 3' end formation involves processing by the exosome complex while 5' end is generated by Kem1p and Rat1p

Where RDN58-1 is in the position we parse as the 'ORF'. So this happens:

> select(org.Sc.sgd.db, "YNCL0014C", "GOALL", "ORF")
Error in .testForValidKeys(x, keys, keytype, fks) : 
  None of the keys entered are valid keys for 'ORF'.


> select(org.Sc.sgd.db, "RDN58-1", "ALIAS", "ORF")
'select()' returned 1:1 mapping between keys and columns
      ORF        SGD ALIAS
1 RDN58-1 S000006488 RDN58

The SGD_features.tab file apparently hasn't been updated since 2019, and in the interim it appears they have changed the ORF to YNCL0014C. It may be that the SGD_features.tab is no longer being updated and we may need to use a different file for these data.

Entering edit mode

I can appreciate how complicated this is -- very frustrating, in fact, and its either all of our faults or none of our faults. It seems like a dumb system to keep so many different annotation sets for model organisms, in particular. We should play more nicely with one another and just agree on a single source, set of unique IDs, and update schedule. We could choose a neutral location for it -- Switzerland, or Turkmenistan, which also has declared its political neutrality, come to mind.

But, more to the point: I do appreciate how much time this has to take to create and maintain. Like everyone else on here I suspect, I very much like bioconductor software, particularly the 'core' software. I am a believer. I'd also like to be able to convince others, especially in the lab where I work, that they should be fully embracing these resources. But, it becomes difficult to make that argument with issues like these. That isn't a criticism at all -- it really is meant as a thank you for those 80 hours spent on this annotation set (and all the hours maintaining other packages here), and an expressed concern that without more regular updates/maintenance (now specifically back to org.Sc.sgd.db. I have no complaints about the update schedule generally) and maybe stronger object/class definitions/templates/whatever, there is a risk that these resources aren't as widely utilized as they should be.

Separating the name mapping objects from the coordinates seems like a great idea. For what it is worth, that gets a big +1 from me. I think a required set of keytypes for all of the 'name mapping' objects is worthwhile, so that any package built on one of the Org packages, for example, can expect that a key SYMBOL exists.

I have been finding issues reports regarding this SYMBOL issue for yeast all over the place -- seems like this has been going on for some time. Here is another example, and I have noticed that this exact question has been asked a number of times on this board also (if I had noticed that before, I probably wouldn't have posted, so apologies for the duplication):



Login before adding your answer.

Traffic: 823 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6