AnnotationHub's NCBI OrgDB have older EGSOURCEDATE than org.Xy.eg.db?
1
0
Entering edit mode
Jenny Drnevich ★ 2.0k
@jenny-drnevich-2812
Last seen 5 months ago
United States

Hello,

I ran into a puzzling situation with AnnotationHub when trying to retrieve updated annotations for rat. NCBI released a new gene model set for rat at the end of July (beginning of Aug by the time it propagated through their ftp server) that we used for a recent RNA-Seq experiment. BioC's org.Rn.eg.db package was created back in Mar/April 2016, and so is missing ~700 new genes. I tried using AnnotationHub to get updated annotations, but despite the fact the snapshotDate() is 2016-08-15, which should have been just after the updated annotations, the OrgDB retrieved for rat has an older EGSOURCEDATE: 2015-Aug11 than does org.Rn.eg.db EGSOURCEDATE: 2016-Mar14. I checked mouse and it has the same problem. Why are the OrgDB in AnnotationHub not current?

Thanks,

Jenny

> library(AnnotationHub)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport,
    clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply,
    parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colnames, do.call, duplicated,
    eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
    pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit

> library(org.Rn.eg.db)
Loading required package: AnnotationDbi
Loading required package: stats4
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with 'browseVignettes()'. To
    cite Bioconductor, see 'citation("Biobase")', and for packages
    'citation("pkgname")'.

Attaching package: ‘Biobase’

The following object is masked from ‘package:AnnotationHub’:

    cache

Loading required package: IRanges
Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    colMeans, colSums, expand.grid, rowMeans, rowSums

> library(org.Mm.eg.db)

> 
> 
> ah = AnnotationHub()
snapshotDate(): 2016-08-15
> 
> #See what they have for Rattus norvegicus, from NCBI and OrgDB
> 
> query(ah, c("OrgDB", "NCBI", "Rattus norvegicus"))
AnnotationHub with 1 record
# snapshotDate(): 2016-08-15 
# names(): AH49585
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Rattus norvegicus
# $rdataclass: OrgDb
# $title: org.Rn.eg.db.sqlite
# $description: NCBI gene ID based annotations about Rattus norvegicus
# $taxonomyid: 10116
# $genome: NCBI genomes
# $sourcetype: NCBI/ensembl
# $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.ensembl.org/pub/current_fasta
# $sourcelastmodifieddate: NA
# $sourcesize: NA
# $tags: NCBI, Gene, Annotation 
# retrieve record with 'object[["AH49585"]]' 
> 
> 
> ah[["AH49585"]]
loading from cache ‘C:/Users/drnevich/Documents/AppData/.AnnotationHub/56315’
OrgDb object:
| DBSCHEMAVERSION: 2.1
| Db type: OrgDb
| Supporting package: AnnotationDbi
| DBSCHEMA: RAT_DB
| ORGANISM: Rattus norvegicus
| SPECIES: Rat
| EGSOURCEDATE: 2015-Aug11
| EGSOURCENAME: Entrez Gene
| EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| CENTRALID: EG
| TAXID: 10116
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
| GOSOURCEDATE: 20150808
| GOEGSOURCEDATE: 2015-Aug11
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| KEGGSOURCENAME: KEGG GENOME
| KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
| KEGGSOURCEDATE: 2011-Mar15
| GPSOURCENAME: UCSC Genome Bioinformatics (Rattus norvegicus)
| GPSOURCEURL: ftp://hgdownload.cse.ucsc.edu/goldenPath/rn6
| GPSOURCEDATE: 2014-Aug1
| ENSOURCEDATE: 2015-Jul16
| ENSOURCENAME: Ensembl
| ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
| UPSOURCENAME: Uniprot
| UPSOURCEURL: http://www.UniProt.org/
| UPSOURCEDATE: Thu Aug 20 15:37:19 2015

Please see: help('select') for usage information
> 
> 
> #compare EGSOURCEDATE with org.Rn.eg.db:
> 
> org.Rn.eg.db
OrgDb object:
| DBSCHEMAVERSION: 2.1
| Db type: OrgDb
| Supporting package: AnnotationDbi
| DBSCHEMA: RAT_DB
| ORGANISM: Rattus norvegicus
| SPECIES: Rat
| EGSOURCEDATE: 2016-Mar14
| EGSOURCENAME: Entrez Gene
| EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| CENTRALID: EG
| TAXID: 10116
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
| GOSOURCEDATE: 20160305
| GOEGSOURCEDATE: 2016-Mar14
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| KEGGSOURCENAME: KEGG GENOME
| KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
| KEGGSOURCEDATE: 2011-Mar15
| GPSOURCENAME: UCSC Genome Bioinformatics (Rattus norvegicus)
| GPSOURCEURL: ftp://hgdownload.cse.ucsc.edu/goldenPath/rn6
| GPSOURCEDATE: 2014-Aug1
| ENSOURCEDATE: 2016-Mar9
| ENSOURCENAME: Ensembl
| ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
| UPSOURCENAME: Uniprot
| UPSOURCEURL: http://www.UniProt.org/
| UPSOURCEDATE: Wed Mar 23 15:52:15 2016

Please see: help('select') for usage information
> 
> 
> #Try mouse:
> 
> query(ah, c("OrgDB", "NCBI", "Mus musculus"))
AnnotationHub with 1 record
# snapshotDate(): 2016-08-15 
# names(): AH49583
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Mus musculus
# $rdataclass: OrgDb
# $title: org.Mm.eg.db.sqlite
# $description: NCBI gene ID based annotations about Mus musculus
# $taxonomyid: 10090
# $genome: NCBI genomes
# $sourcetype: NCBI/ensembl
# $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.ensembl.org/pub/current_fasta
# $sourcelastmodifieddate: NA
# $sourcesize: NA
# $tags: NCBI, Gene, Annotation 
# retrieve record with 'object[["AH49583"]]' 
> 
> 
> ah[["AH49583"]]
loading from cache ‘C:/Users/drnevich/Documents/AppData/.AnnotationHub/56313’
OrgDb object:
| DBSCHEMAVERSION: 2.1
| Db type: OrgDb
| Supporting package: AnnotationDbi
| DBSCHEMA: MOUSE_DB
| ORGANISM: Mus musculus
| SPECIES: Mouse
| EGSOURCEDATE: 2015-Aug11
| EGSOURCENAME: Entrez Gene
| EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| CENTRALID: EG
| TAXID: 10090
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
| GOSOURCEDATE: 20150808
| GOEGSOURCEDATE: 2015-Aug11
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| KEGGSOURCENAME: KEGG GENOME
| KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
| KEGGSOURCEDATE: 2011-Mar15
| GPSOURCENAME: UCSC Genome Bioinformatics (Mus musculus)
| GPSOURCEURL: ftp://hgdownload.cse.ucsc.edu/goldenPath/mm10
| GPSOURCEDATE: 2012-Mar8
| ENSOURCEDATE: 2015-Jul16
| ENSOURCENAME: Ensembl
| ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
| UPSOURCENAME: Uniprot
| UPSOURCEURL: http://www.UniProt.org/
| UPSOURCEDATE: Thu Aug 20 15:49:03 2015

Please see: help('select') for usage information
> 
> #compare EGSOURCEDATE with org.Mm.eg.db:
> 
> org.Mm.eg.db
OrgDb object:
| DBSCHEMAVERSION: 2.1
| Db type: OrgDb
| Supporting package: AnnotationDbi
| DBSCHEMA: MOUSE_DB
| ORGANISM: Mus musculus
| SPECIES: Mouse
| EGSOURCEDATE: 2016-Mar14
| EGSOURCENAME: Entrez Gene
| EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| CENTRALID: EG
| TAXID: 10090
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
| GOSOURCEDATE: 20160305
| GOEGSOURCEDATE: 2016-Mar14
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| KEGGSOURCENAME: KEGG GENOME
| KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
| KEGGSOURCEDATE: 2011-Mar15
| GPSOURCENAME: UCSC Genome Bioinformatics (Mus musculus)
| GPSOURCEURL: ftp://hgdownload.cse.ucsc.edu/goldenPath/mm10
| GPSOURCEDATE: 2012-Mar8
| ENSOURCEDATE: 2016-Mar9
| ENSOURCENAME: Ensembl
| ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
| UPSOURCENAME: Uniprot
| UPSOURCEURL: http://www.UniProt.org/
| UPSOURCEDATE: Wed Mar 23 15:59:16 2016

Please see: help('select') for usage information
> 
> 
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] org.Mm.eg.db_3.3.0   org.Rn.eg.db_3.3.0   AnnotationDbi_1.34.4 IRanges_2.6.1       
[5] S4Vectors_0.10.3     Biobase_2.32.0       AnnotationHub_2.4.2  BiocGenerics_0.18.0 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7                   digest_0.6.10                
 [3] mime_0.5                      R6_2.1.3                     
 [5] xtable_1.8-2                  DBI_0.5                      
 [7] RSQLite_1.0.0                 BiocInstaller_1.22.3         
 [9] httr_1.2.1                    curl_1.2                     
[11] tools_3.3.1                   shiny_0.13.2                 
[13] httpuv_1.3.3                  htmltools_0.3.5              
[15] interactiveDisplayBase_1.10.3
AnnotationHub • 800 views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.0 years ago
United States

Hi Jenny,

Unfortunately the OrgDbs in AnnotationHub were not updated for the Spring 2016 release. This was due to staffing constraints and an unclear path forward for handling these objects in the hub. All OrgDbs will be updated for the Fall 2016 release next month but they will stay on a 6 month cycle. Updating the OrgDb packages as each new gene model becomes available isn't feasible for us right now. The AnnotationForge::makeOrgPackage* functions are available for those who want to forge a package from the most current data.

We've also had a problem with multiple OrgDbs in the hub with the same title.  As you know, the OrgDbs differ from the TxDbs or BSgenome annotations in that they aren't tied to a particular release or genome build - they just represent the most current data at the time. Because the titles don't include a date or genome, one OrgDb can't be distinguished from the next version other than checking the 'rdatadateadded' metadata field (exposed in devel):

hub <- AnnotationHub()

mcols(hub[1])$rdatadateadded

In the next release, only the OrgDbs associated with the Bioconductor version being used will be exposed to the user. The plan is for AnnotationHub to host both the 'standard' organisms available in our repo

  http://www.bioconductor.org/packages/release/BiocViews.html#___OrgDb

as well as a large number (>1000) other 'non-standard' organisms.

Sorry this is has been less than satisfying and we hope to have this cleaned up after the next release.

Valerie

PS: The snapshotdate in AnnotationHub represents the last time a change was made to any metadata in the data base. It's not an indicator that all pre-built resources have been regenerated. For example, the snapshotdate changes when new Ensembl FASTA files are added but that doesn't mean any other resources have changed.

ADD COMMENT
0
Entering edit mode
Hi Valerie, Thanks for the explanation – that makes more sense. NCBI is a bit of a mess with all the gene annotations – they did actually start versioning their gene set annotations (see example ftp://ftp.ncbi.nlm.nih.gov/genomes/Rattus_norvegicus/README_CURRENT_RELEASE) but you are right that the files NCBI provides on gene don’t include the version number in their names. There are also many different locations for the genome and gene models: ftp://ftp.ncbi.nlm.nih.gov/genomes/Rattus_norvegicus ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000001895.5_Rnor_6.0/ ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Rattus_norvegicus/all_assembly_versions/GCF_000001895.5_Rnor_6.0 It’s all very confusing. From what location/files do you pull the information for making org.Xx.eg.db packages and the OrgDB objects? Thanks, Jenny From: Valerie Obenchain [bioc] [mailto:noreply@bioconductor.org] Sent: Friday, September 09, 2016 8:50 AM To: Zadeh, Jenny Drnevich <drnevich@illinois.edu> Subject: [bioc] A: AnnotationHub's NCBI OrgDB have older EGSOURCEDATE than org.Xy.eg.db? Activity on a post you are following on support.bioconductor.org<https: urldefense.proofpoint.com="" v2="" url?u="&lt;a href=" http:="" <a="" href="http://https-3A__support.bioconductor.org" rel="nofollow">https-3A__support.bioconductor.org"="" rel="nofollow">https-3A__support.bioconductor.org&d=CwMDaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=6-Bh1FprrmfLCzuKMeeZ1IaQQqjRPj_xNcuCh6hXgSU&m=tIbl65eabI7eMtwz9aQSybc4fzlFys3Zonwvt9TEm7c&s=1p0Qba3e6iQjU8Hwl6PBz6LEekW2TrFUp_pNbMKsN4A&e="> User Valerie Obenchain<https: urldefense.proofpoint.com="" v2="" url?u="https-3A__support.bioconductor.org_u_4275_&amp;d=CwMDaQ&amp;c=8hUWFZcy2Z-Za5rBPlktOQ&amp;r=6-Bh1FprrmfLCzuKMeeZ1IaQQqjRPj_xNcuCh6hXgSU&amp;m=tIbl65eabI7eMtwz9aQSybc4fzlFys3Zonwvt9TEm7c&amp;s=ID6yPXtiZhSaSR1jg5G9L6CkfbZTosNAhuiM9iRYdQ8&amp;e="> wrote Answer: AnnotationHub's NCBI OrgDB have older EGSOURCEDATE than org.Xy.eg.db?<https: urldefense.proofpoint.com="" v2="" url?u="https-3A__support.bioconductor.org_p_86929_-2386958&amp;d=CwMDaQ&amp;c=8hUWFZcy2Z-Za5rBPlktOQ&amp;r=6-Bh1FprrmfLCzuKMeeZ1IaQQqjRPj_xNcuCh6hXgSU&amp;m=tIbl65eabI7eMtwz9aQSybc4fzlFys3Zonwvt9TEm7c&amp;s=g5Q31OnD8O0tkoLi0gxVIKOEYLs8ek0f62qqOEWgOQQ&amp;e=">: Hi Jenny, Unfortunately the OrgDbs in AnnotationHub were not updated for the Spring 2016 release. This was due to staffing constraints and an unclear path forward for handling these objects in the hub. All OrgDbs will be updated for the Fall 2016 release next month but they will stay on a 6 month cycle. Updating the OrgDb packages as each new gene model becomes available isn't feasible for us right now. The AnnotationForge::makeOrgPackage* functions are available for those who want to forge a package from the most current data. We've also had a problem with multiple OrgDbs in the hub with the same title. As you know, the OrgDbs differ from the TxDbs or BSgenome annotations in that they aren't tied to a particular release or genome build - they just represent the most current data at the time. Because the titles don't include a date or genome, one OrgDb can't be distinguished from the next version other than checking the 'rdatadateadded' metadata field (exposed in devel): hub <- AnnotationHub() mcols(hub[1])$rdatadateadded In the next release, only the OrgDbs associated with the Bioconductor version being used will be exposed to the user. The plan is for AnnotationHub to host both the 'standard' organisms available in our repo http://www.bioconductor.org/packages/release/BiocViews.html#___OrgDb<https: urldefense.proofpoint.com="" v2="" url?u="http-3A__www.bioconductor.org_packages_release_BiocViews.html-23-5F-5F-5FOrgDb&amp;d=CwMDaQ&amp;c=8hUWFZcy2Z-Za5rBPlktOQ&amp;r=6-Bh1FprrmfLCzuKMeeZ1IaQQqjRPj_xNcuCh6hXgSU&amp;m=tIbl65eabI7eMtwz9aQSybc4fzlFys3Zonwvt9TEm7c&amp;s=FThqtVXRgi0oqO2QafsK4aARfiraVzS_bzx7qdx6Qoo&amp;e="> as well as a large number (>1000) other 'non-standard' organisms. Sorry this is has been less than satisfying and we hope to have this cleaned up after the next release. Valerie PS: The snapshotdate in AnnotationHub represents the last time a change was made to any metadata in the data base. It's not an indicator that all pre-built resources have been regenerated. For example, the snapshotdate changes when new Ensembl FASTA files are added but that doesn't mean any other resources have changed. ________________________________ Post tags: AnnotationHub You may reply via email or visit A: AnnotationHub's NCBI OrgDB have older EGSOURCEDATE than org.Xy.eg.db?
ADD REPLY
0
Entering edit mode

The standard organism OrgDbs in our repo

  http://www.bioconductor.org/packages/release/BiocViews.html#___OrgDb

are comprised of data downloaded from multiple locations, UCSC, NCBI, Ensembl, etc. The other non-standard organism OrgDbs in AnnotationHub are made with makeOrgPackageFromNCBI() which downloads from

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/

ftp://ftp.geneontology.org/pub/go/godata

ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping

Valerie

 

 

ADD REPLY

Login before adding your answer.

Traffic: 341 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6