When retrieving annotation information from an AnnotationHub OrgDb object I noticed something odd, of which i think is wrong / a bug.
I use the AnnotationHub to retrieve an OrgDb object for Chinese Hamster (org.Cg.eg.db). I expect the the primary key (identifier) to be an ENTREZID, and I expect these to be unique, and each key to consist of a single ENTREZID. However, I noticed that for some cases this is NOT the case, i.e. some keys are comprised of multiple ENTREZIDs. AFAIK this should not happen, and is therefore a bug. If so, where does it go wrong? At the NCBI (where the data is obtained from), or is this data wrongly formatted when 'loaded' in the AnnotationHub? Or am I wrong and is this behavior expected and OK?
Thanks,
Guido
> library(AnnotationHub) > hub = AnnotationHub() snapshotDate(): 2016-01-14 > org.Cg.eg.db <- hub[["AH48061"]] > > head(keys(org.Cg.eg.db)) [1] "3979178" "3979179" "3979180" "3979181" "3979182" "3979183" > >#^^ is as I expected; a key is just a single ENTREZID > > tail(keys(org.Cg.eg.db)) [1] "100750843; 100758862" "100752919; 100758152" "100764275; 103163203" "100763702; 100767971" "100757471; 103160154" "100767133; 103158540" > ># ^^ mmm, this is what I did NOT expect; a key consists of multiple ENTREZIDs separated by a semi-colon. > > > # To better illustrate my point: > > keys(org.Cg.eg.db)[1:25] # First 25 keys [1] "3979178" "3979179" "3979180" "3979181" "3979182" "3979183" "3979184" "3979185" "3979186" "3979187" "3979188" "3979189" "3979190" "100682525" "100682526" "100682527" [17] "100682528" "100682529" "100682530" "100682531" "100682532" "100682533" "100682534" "100682535" "100682536" > > > keys(org.Cg.eg.db)[29403:29447] # Last 45 keys [1] "100820697" [2] "100820698" [3] "" [4] "100689295; 100768092" [5] "100689088; 100689091" [6] "100750560; 100752605; 100754396; 100757199; 100767612; 100768756; 100774182; 103163321; 103163325" [7] "100761562; 103163324" [8] "100751148; 100751440; 100752010; 100760308; 100762049; 100769330; 100772735" [9] "100750550; 100759151; 100761758" [10] "100753203; 100754104; 100773314" [11] "100757772; 100758065; 100758350" [12] "100766145; 100769996" [13] "100750854; 100751726; 100753804; 100769039; 100773020; 100775043; 103163323" [14] "100760221; 103162924" [15] "100755989; 100757294" [16] "100761794; 100762084; 100763468" [17] "100758301; 103161091" [18] "103159947; 103163218" [19] "100763290; 100766431; 100766720" [20] "100768462; 100769330" [21] "100763177; 100769659" [22] "100689287; 100689444" [23] "100752156; 103162765" [24] "100766671; 100767745" [25] "100760868; 103158544" [26] "100762888; 100765599" [27] "100765872; 100766167" [28] "100757644; 100772835" [29] "100773694; 103162813" [30] "100771740; 103161637" [31] "100752547; 103161507" [32] "100763573; 100766345" [33] "100763788; 100769775" [34] "100757206; 103162614" [35] "100756590; 100768608" [36] "100762689; 103162797" [37] "100762749; 100765281; 100771772" [38] "100754758; 100759980" [39] "100758844; 100765796" [40] "100750843; 100758862" [41] "100752919; 100758152" [42] "100764275; 103163203" [43] "100763702; 100767971" [44] "100757471; 103160154" [45] "100767133; 103158540" >
Thanks for reporting this. We're looking into it.
Valerie