Why No Uniprot information for some other organisms in AnnotationHub?
Entering edit mode
Last seen 5.0 years ago

Hi Marc and Others,

I am trying to use the wonderful package 'AnnotationHub' to retrieve some information, however, I found a little tricky problem-No Uniprot information for some other organisms in AnnotationHub, shown as below:

For Homo sapiens:

> library(AnnotationHub)
> hub <- AnnotationHub()
snapshotDate(): 2016-08-15

> query(hub, c("OrgDb","Homo sapiens"))
AnnotationHub with 1 record
# snapshotDate(): 2016-08-15 
# names(): AH49582
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Homo sapiens
# $rdataclass: OrgDb
# $title: org.Hs.eg.db.sqlite
# $description: NCBI gene ID based annotations about Homo sapiens
# $taxonomyid: 9606
# $genome: NCBI genomes
# $sourcetype: NCBI/ensembl
# $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.ensembl.org/pub/current_fasta
# $sourcelastmodifieddate: NA
# $sourcesize: NA
# $tags: NCBI, Gene, Annotation 
# retrieve record with 'object[["AH49582"]]' 
> human<-hub[["AH49582"]]
loading from cache :/Users/RCPA/Documents/AppData/.AnnotationHub/56312?

> keytypes(human)
 [8] "EVIDENCE"     "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "IPI"          "MAP"         
[15] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"         "PROSITE"     
[22] "REFSEQ"       "SYMBOL"       "UCSCKG"       "UNIGENE"      "UNIPROT"     

For Solanum lycopersicum:

> query(hub, c("OrgDb","Solanum lycopersicum"))
AnnotationHub with 2 records
# snapshotDate(): 2016-08-15 
# $dataprovider: NCBI, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Solanum lycopersicum
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description, tags, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH13359"]]' 

  AH13359 | org.Solanum_lycopersicum.eg.sqlite
  AH48047 | org.Solanum_lycopersicum.eg.sqlite

> tomato<-hub[["AH48047"]]
loading from cache :/Users/RCPA/Documents/AppData/.AnnotationHub/54353?

> keytypes(tomato)
 [1] "ACCNUM"      "ALIAS"       "ENTREZID"    "EVIDENCE"    "EVIDENCEALL" "GENENAME"    "GID"         "GO"         
 [9] "GOALL"       "ONTOLOGY"    "ONTOLOGYALL" "PMID"        "REFSEQ"      "SYMBOL"      "UNIGENE"


As you can see, unexpectedly, No "UNIPROT" in tomato! I think "UNIPROT" is one of the most basic information for any organism, it should be included.

Therefore, could you give some suggestion for this or provide an approach to add the "UNIPROT" information in it?

My sessionInfo():

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GenomeInfoDb_1.8.3    clusterProfiler_3.0.4 DOSE_2.10.7           org.Hs.eg.db_3.3.0    sqldf_0.4-10         
 [6] RSQLite_1.0.0         DBI_0.4-1             gsubfn_0.6-6          proto_0.3-10          AnnotationDbi_1.34.4 
[11] IRanges_2.6.1         S4Vectors_0.10.2      Biobase_2.32.0        AnnotationHub_2.4.2   BiocGenerics_0.18.0  

loaded via a namespace (and not attached):
 [1] qvalue_2.4.2                  shinyjs_0.6                   reshape2_1.4.1               
 [4] lattice_0.20-33               splines_3.3.1                 tcltk_3.3.1                  
 [7] colorspace_1.2-6              miniUI_0.1.1                  htmltools_0.3.5              
[10] chron_2.3-47                  interactiveDisplayBase_1.10.3 XML_3.98-1.4                 
[13] topGO_2.24.0                  matrixStats_0.50.2            plyr_1.8.4                   
[16] stringr_1.0.0                 munsell_0.4.3                 GOSemSim_1.30.3              
[19] gtable_0.2.0                  SparseM_1.7                   httpuv_1.3.3                 
[22] BiocInstaller_1.22.3          curl_1.1                      GSEABase_1.34.0              
[25] Rcpp_0.12.6                   xtable_1.8-2                  scales_0.4.0                 
[28] DO.db_2.9                     graph_1.50.0                  annotate_1.50.0              
[31] mime_0.5                      ggplot2_2.1.0                 digest_0.6.10                
[34] stringi_1.1.1                 shiny_0.13.2                  grid_3.3.1                   
[37] tools_3.3.1                   magrittr_1.5                  tibble_1.1                   
[40] GO.db_3.3.0                   tidyr_0.5.1                   rsconnect_0.4.3              
[43] assertthat_0.1                httr_1.2.1                    R6_2.1.2                     
[46] igraph_1.0.1 

Thank a lot for helping^_^




hub annotationhub • 921 views
Entering edit mode
Last seen 10 months ago
United States

Hi Shisheng,

The human OrgDb was made with a different set of scripts than the two tomato OrgDbs. The human OrgDb is one of the 'standard' organisms we host in our repo:


Raw data for the standard OrgDb packages are pulled from many different sources and are the most comprehensive. As a convenience, we also provide OrgDbs for 'non-standard' organisms in AnnotationHub made with AnnotationForge::makeOrgPackageFromNCBI(); these are less comprehensive and pull data primarily from the UCSC browser.

makeOrgPackageFromNCBI() does download a file from UniProt but it's used to create the altGO data table. I'm not sure why the uniprot identifiers weren't included and exposed as a keytype. We may consider adding these in the future but it won't happen before the release.

In the meantime, you can use the UniProt.ws package.

hub <- AnnotationHub()
tomato <- query(hub, c("OrgDb","Solanum lycopersicum"))

> mcols(tomato)[, c("sourcetype", "rdatadateadded")]
DataFrame with 2 rows and 2 columns
           sourcetype rdatadateadded
          <character>    <character>
AH13359 NCBI/blast2GO     2014-07-09
AH48047  NCBI/UniProt     2015-07-27

We'll use the more current NCBI/UniProt resource. First get the gene ids from the OrgDb you're working with:

entrezid <- keys(tomato[[2]])

Create a UniProt.ws object (with the tax id if you have it):

> lookupUniprotSpeciesFromTaxId(4081)
[1] "Solanum lycopersicum"
up <- UniProt.ws(taxId=4081)

Decide which columns you want back:

> head(columns(up))
[1] "3D"                  "AARHUS/GHENT-2DPAGE" "AGD"              
[4] "ALLERGOME"           "ARACHNOSERVER"       "BIOCYC"

Call select():

> res <- select(up, keys=entrezid, keytype="ENTREZ_GENE", columns="UNIPROTKB")
Getting mapping data for 543501 ... and ACC
'select()' returned 1:many mapping between keys and columns
> dim(res)
[1] 30934     2
> head(res)
1      543501    O48645
2      543502    O82119
3      543502    Q8LRN7
4      543502    Q8LRN8
5      543506    K4B9Y9
6      543506    Q9ZWP2


Entering edit mode

Hi Valerie Obenchain, I am facing a similar problem. I want to to a GO enrichment analysis (ORA and/or GESA) from proteomics dataset, so I have UniProt IDs. I was thinking to use clusterprofiler as it looks like an easy and effective package to use for this analysis. Unfortunately clusterprofiler supports only org.db of model organism. I am trying to use import solanum lycopersicum from AnnotationHub but, seems not working and also does't have UNIPROT IDs yet. Could you suggest anything? Any kind of solution is well appreciated. Thanks, Alberto.

Entering edit mode

Alberto You might get more assistance by creating a new post item for this. This original post is very old and there might not be as many followers that a new post would attract for answers. If you do I suggest selecting clusterprofiler and annotationhub in Post Tags as well as any others you think useful.


Login before adding your answer.

Traffic: 261 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6