Org.db: why a supposed unique key (ID) has multiple entries?
1
0
Entering edit mode
Guido Hooiveld ★ 4.1k
@guido-hooiveld-2020
Last seen 6 days ago
Wageningen University, Wageningen, the …

When retrieving annotation information from an AnnotationHub OrgDb object I noticed something odd, of which i think is wrong / a bug.

 

I use the AnnotationHub to retrieve an OrgDb object for Chinese Hamster (org.Cg.eg.db). I expect the the primary key (identifier) to be an ENTREZID, and I expect these to be unique, and each key to consist of a single ENTREZID. However, I noticed that for some cases this is NOT the case, i.e. some keys are comprised of multiple ENTREZIDs. AFAIK this should not happen, and is therefore a bug. If so, where does it go wrong? At the NCBI (where the data is obtained from), or is this data wrongly formatted when 'loaded' in the AnnotationHub? Or am I wrong and is this behavior expected and OK?

 

Thanks,

Guido

 

> library(AnnotationHub)
> hub = AnnotationHub()
snapshotDate(): 2016-01-14
> org.Cg.eg.db <- hub[["AH48061"]]
>
> head(keys(org.Cg.eg.db))
[1] "3979178" "3979179" "3979180" "3979181" "3979182" "3979183"
>
>#^^ is as I expected; a key is just a single ENTREZID
>
> tail(keys(org.Cg.eg.db))
[1] "100750843; 100758862" "100752919; 100758152" "100764275; 103163203" "100763702; 100767971" "100757471; 103160154" "100767133; 103158540"
>
># ^^ mmm, this is what I did NOT expect; a key consists of multiple ENTREZIDs separated by a semi-colon.
>
>
> # To better illustrate my point:
>
> keys(org.Cg.eg.db)[1:25] # First 25 keys
 [1] "3979178"   "3979179"   "3979180"   "3979181"   "3979182"   "3979183"   "3979184"   "3979185"   "3979186"   "3979187"   "3979188"   "3979189"   "3979190"   "100682525" "100682526" "100682527"
[17] "100682528" "100682529" "100682530" "100682531" "100682532" "100682533" "100682534" "100682535" "100682536"
>
>
> keys(org.Cg.eg.db)[29403:29447] # Last 45 keys
 [1] "100820697"                                                                                        
 [2] "100820698"                                                                                        
 [3] ""                                                                                                 
 [4] "100689295; 100768092"                                                                             
 [5] "100689088; 100689091"                                                                             
 [6] "100750560; 100752605; 100754396; 100757199; 100767612; 100768756; 100774182; 103163321; 103163325"
 [7] "100761562; 103163324"                                                                             
 [8] "100751148; 100751440; 100752010; 100760308; 100762049; 100769330; 100772735"                      
 [9] "100750550; 100759151; 100761758"                                                                  
[10] "100753203; 100754104; 100773314"                                                                  
[11] "100757772; 100758065; 100758350"                                                                  
[12] "100766145; 100769996"                                                                             
[13] "100750854; 100751726; 100753804; 100769039; 100773020; 100775043; 103163323"                      
[14] "100760221; 103162924"                                                                             
[15] "100755989; 100757294"                                                                             
[16] "100761794; 100762084; 100763468"                                                                  
[17] "100758301; 103161091"                                                                             
[18] "103159947; 103163218"                                                                             
[19] "100763290; 100766431; 100766720"                                                                  
[20] "100768462; 100769330"                                                                             
[21] "100763177; 100769659"                                                                             
[22] "100689287; 100689444"                                                                             
[23] "100752156; 103162765"                                                                             
[24] "100766671; 100767745"                                                                             
[25] "100760868; 103158544"                                                                             
[26] "100762888; 100765599"                                                                             
[27] "100765872; 100766167"                                                                             
[28] "100757644; 100772835"                                                                             
[29] "100773694; 103162813"                                                                             
[30] "100771740; 103161637"                                                                             
[31] "100752547; 103161507"                                                                             
[32] "100763573; 100766345"                                                                             
[33] "100763788; 100769775"                                                                             
[34] "100757206; 103162614"                                                                             
[35] "100756590; 100768608"                                                                             
[36] "100762689; 103162797"                                                                             
[37] "100762749; 100765281; 100771772"                                                                  
[38] "100754758; 100759980"                                                                             
[39] "100758844; 100765796"                                                                             
[40] "100750843; 100758862"                                                                             
[41] "100752919; 100758152"                                                                             
[42] "100764275; 103163203"                                                                             
[43] "100763702; 100767971"                                                                             
[44] "100757471; 103160154"                                                                             
[45] "100767133; 103158540"                                                                             
>

 

annotationhub organismdb orgdb unsupported organisms • 3.0k views
ADD COMMENT
0
Entering edit mode

Thanks for reporting this. We're looking into it.

Valerie

ADD REPLY
2
Entering edit mode
@valerie-obenchain-4275
Last seen 3.0 years ago
United States

Hi,

This has been fixed. I've regenerated the org.Cricetulus_griseus.eg.sqlite and pushed the new version to S3. You'll need to use removeCache() and try the download again to get the updated version.

> hub = AnnotationHub()
snapshotDate(): 2016-01-25

> org = hub[["AH48061"]]

> org
OrgDb object:
| DBSCHEMAVERSION: 2.1
| DBSCHEMA: NOSCHEMA_DB
| ORGANISM: Cricetidae griseus
| SPECIES: Cricetidae griseus
| CENTRALID: GID
| Taxonomy ID: 10029
| Db type: OrgDb
| Supporting package: AnnotationDbi

Please see: help('select') for usage information
> head(keys(org))
[1] "3979178" "3979179" "3979180" "3979181" "3979182" "3979183"
> tail(keys(org))
[1] "100820681" "100820687" "100820690" "100820692" "100820697" "100820698"

Let me know if you run into other problems.

Thanks.

Valerie

ADD COMMENT
0
Entering edit mode

Thanks Valerie!

One small remark: I noticed that by fixing this problem you changed both ORGANISM and SPECIES names into Cricetidae griseus. Let me first say that I am not an expert on this, but should these not be Chinese Hamster and Cricetulus griseus, respectively?

Thanks,

Guido

 

 

ADD REPLY
0
Entering edit mode

Chinese hamster is the common name and we don't have a metadata field for that. The ORGANISM and SPECIES fields of the OrgDb packages are the combination of genus and species. This is consistent with the other OrgDb resources, e.g.,

> query(hub, "Acanthisitta")[[1]]
downloading from ‘https://annotationhub.bioconductor.org/fetch/54542’
retrieving 1 resource
  |======================================================================| 100%
OrgDb object:
| DBSCHEMAVERSION: 2.1
| DBSCHEMA: NOSCHEMA_DB
| ORGANISM: Acanthisitta chloris
| SPECIES: Acanthisitta chloris
| CENTRALID: GID
| Taxonomy ID: 57068
| Db type: OrgDb
| Supporting package: AnnotationDbi


Valerie

ADD REPLY
0
Entering edit mode

OK, got it... I don't want to be a smarty pants, but IMHO it then still should be Cricetulus griseus (and not Cricetidae griseus). :)

See e.g. here and here.

Family = Cricetidae
Genus = Cricetulus
Species = C. griseus
Binomial name = Cricetulus griseus

ADD REPLY
0
Entering edit mode

Ah, I missed the Cricetidae vs Cricetulus. Yes, I agree. I will re-generate with the correct genus.

Thanks.

Valerie

ADD REPLY
1
Entering edit mode

OK, this should be fixed now.

Valerie

ADD REPLY
0
Entering edit mode

Thanks again!

ADD REPLY
0
Entering edit mode

Hi Valerie,

Sorry to revive this old thread, but since it is dealing with the OrgDb for Chinese Hamster (in which I have an interest) I decided to post here...

In a recent reply from James (C: problem with makeOrgPackageFromNCBI when making an annotation package) I noticed that currently 2 OrgDb's for Chinese Hamster seemed to be present in the AnnotationHub, which surprised me...:

> hub <- AnnotationHub()
updating metadata: retrieving 1 resource                                                                                                                              
  |======================================================================| 100%
snapshotDate(): 2016-07-20
> query(hub, c("OrgDb","Cricetulus griseus"))
AnnotationHub with 2 records
# snapshotDate(): 2016-07-20
# $dataprovider: NCBI, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Cricetulus griseus
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype
# retrieve records with, e.g., 'object[["AH12820"]]'

            title                           
  AH12820 | org.Cricetulus_griseus.eg.sqlite
  AH48061 | org.Cricetulus_griseus.eg.sqlite
>

I did some little investigation, and it turned out that the content of both objects is the same; that is they contain the same gene id's. However, object AH12820 is different (wrong) in the sense that the binomial name is not correct (first part: Cricetidae vs the [correct] Cricetulus). In other words, I believe object AH12820 is redundant and should be removed from the hub.

As a side node: is it somehow possible to get info on which data sources where used when the object was build? In other words, how up to date are these annotation objects?

Thanks,

Guido

> org.Cg.AH48061 <- hub[["AH48061"]]
> org.Cg.AH48061
OrgDb object:
| DBSCHEMAVERSION: 2.1
| DBSCHEMA: NOSCHEMA_DB
| ORGANISM: Cricetulus griseus
| SPECIES: Cricetulus griseus
| CENTRALID: GID
| Taxonomy ID: 10029
| Db type: OrgDb
| Supporting package: AnnotationDbi

> org.Cg.AH12820 <- hub[["AH12820"]]
> org.Cg.AH12820
OrgDb object:
| DBSCHEMAVERSION: 2.1
| DBSCHEMA: NOSCHEMA_DB
| ORGANISM: Cricetidae griseus
| SPECIES: Cricetidae griseus
| CENTRALID: GID
| Taxonomy ID: 10029
| Db type: OrgDb
| Supporting package: AnnotationDbi

> length(keys(org.Cg.AH48061))
[1] 29404
> length(keys(org.Cg.AH12820))
[1] 29404

> select(org.Cg.AH48061, head(keys(org.Cg.AH48061)), "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
      GID SYMBOL
1 3979178   ND4L
2 3979179    ND4
3 3979180    ND5
4 3979181    ND6
5 3979182   CYTB
6 3979183    ND1
> select(org.Cg.AH12820, head(keys(org.Cg.AH12820)), "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
      GID SYMBOL
1 3979178   ND4L
2 3979179    ND4
3 3979180    ND5
4 3979181    ND6
5 3979182   CYTB
6 3979183    ND1

 

ADD REPLY
0
Entering edit mode
> z <- query(hub, c("OrgDb","Cricetulus griseus"))


> mcols(z)[c("sourcetype","sourceurl")]
DataFrame with 2 rows and 2 columns
           sourcetype
          <character>
AH12820 NCBI/blast2GO
AH48061  NCBI/UniProt
                                                                                                                                                   sourceurl
                                                                                                                                                 <character>
AH12820                                                                                       ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, http://www.blast2go.de/
AH48061 ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz
>
ADD REPLY
0
Entering edit mode

Hi Guido,

As Jim pointed out, one of the OrgDbs is from blast2GO and the other UniProt. blast2GO was replaced with UniProt which was thought to be more comprehensive.  While the packages may be very similar or even the same except for the 'Cricetidae' issue, they do come from different sources ('sourceurl' in Jim's post).

As for how old they are, unfortunately the current mcols() don't expose that information. The field you want is 'rdatadateadded', here's how you get it:

> library(DBI)

Substitute the path to your local .AnnotationHub directory:

> conn <- dbConnect(RSQLite::SQLite(), "/home/vobencha/.AnnotationHub/annotationhub.sqlite3")

> dbGetQuery(conn, "SELECT rdatadateadded FROM resources WHERE ah_id IN ('AH12820', 'AH48061')")
  rdatadateadded
1     2014-07-09
2     2015-07-27

I'll add 'rdatadateadded' to mcols() sometime in the next week. I'm not going to change the species name or remove the blast2GO resource. It's somewhat historical now but non-harmful.


Valerie

ADD REPLY
0
Entering edit mode

Aha, I got it, there is more to it than I naively thought... Thanks to both of you; very helpful!

Please allow me to ask one more question: is it correct to assume the content of the AnnotationHub objects is updated twice a year, when a new version of BioC is released, like the 'regular' OrgDb annotation packages (e.g. org. eg.Hs.db)?

Thanks,

Guido

ADD REPLY
0
Entering edit mode

In general, resources in AnnotationHub are updated as the newest version becomes available, e.g., fasta, gtf etc. In the specific case of the OrgDbs in the hub, no, these have not been updated at each release. The plan is to get there but time and manpower are limiting.

Valerie

ADD REPLY

Login before adding your answer.

Traffic: 273 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6