Yes, you can build yourselves an OrgDb
for Tetrahymena thermophile using the function makeOrgPackageFromNCBI()
from the library AnnotationForge
. You will only need the taxonomy ID of T. Thermophile, which apparently is 312017. Please note that this OrgDb
contains (only) the annotation information available at the NCBI (for Tetrahymena thermophila SB210).
# download files from NCBI and create an annotation database named "org.Tthermophila.eg.db"
# you will need ~25GB disk space for this.
# running time function was ~ 7hr on my computer, during that time leave R session untouched.
# you can ignore the waring on removing the file './org.Tthermophila.eg.sqlite'
> # set working dir to a location with sufficient HDD space
> setwd("D:\\my\\favorite\\directory")
> # load required libraries.
>library(AnnotationForge)
>library(AnnotationDbi)
>library(GenomeInfoDb)
> #step below takes long time!
>makeOrgPackageFromNCBI(version="0.0.1", author = "First Last Name <email@address.com>", maintainer = "First Last Name <email@address.com>", ".", tax_id = "312017", genus = "Tetrahymena", species= "thermophila")
#Next install the generated org.Tthermophila.eg.db for use in R.
> install.packages(pkgs="./org.Tthermophila.eg.db", repos=NULL, type="source")
* installing *source* package 'org.Tthermophila.eg.db' ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
*** arch - i386
*** arch - x64
* DONE (org.Tthermophila.eg.db)
#check
> library(org.Tthermophila.eg.db)
> org.Tthermophila.eg.db
OrgDb object:
| DBSCHEMAVERSION: 2.1
| DBSCHEMA: NOSCHEMA_DB
| ORGANISM: Tetrahymena thermophila
| SPECIES: Tetrahymena thermophila
| CENTRALID: GID
| Taxonomy ID: 312017
| Db type: OrgDb
| Supporting package: AnnotationDbi
> columns(org.Tthermophila.eg.db)
[1] "ACCNUM" "ALIAS" "ENTREZID" "EVIDENCE" "EVIDENCEALL"
[6] "GENENAME" "GID" "GO" "GOALL" "ONTOLOGY"
[11] "ONTOLOGYALL" "PMID" "REFSEQ" "SYMBOL"
> keytypes(org.Tthermophila.eg.db)
[1] "ACCNUM" "ALIAS" "ENTREZID" "EVIDENCE" "EVIDENCEALL"
[6] "GENENAME" "GID" "GO" "GOALL" "ONTOLOGY"
[11] "ONTOLOGYALL" "PMID" "REFSEQ" "SYMBOL"
> head(keys(org.Tthermophila.eg.db))
[1] "7822955" "7822974" "7823109" "7823219" "7823307" "7823613"
> #Check: number of keys (genes) indeed corresponds to # genes listed @ NCBI [=26997]
> length(keys(org.Tthermophila.eg.db))
[1] 26997
> mykeys <- keys(org.Tthermophila.eg.db)[1:25]
> anno.result <- select(org.Tthermophila.eg.db, keys=mykeys, columns=c("ENTREZID","SYMBOL","GENENAME","ALIAS","GO"),keytype="ENTREZID")
'select()' returned 1:many mapping between keys and columns
> head(anno.result)
ENTREZID SYMBOL GENENAME ALIAS GO
1 7822955 TTHERM_00136120 60S ribosomal protein L6 TTHERM_00136120 GO:0022625
2 7822955 TTHERM_00136120 60S ribosomal protein L6 TTHERM_00136120 GO:0003735
3 7822955 TTHERM_00136120 60S ribosomal protein L6 TTHERM_00136120 GO:0002181
4 7822955 TTHERM_00136120 60S ribosomal protein L6 TTHERM_00136120 GO:0000027
5 7822974 TTHERM_00134940 60S ribosomal protein L23a TTHERM_00134940 GO:0022625
6 7822974 TTHERM_00134940 60S ribosomal protein L23a TTHERM_00134940 GO:0019843
> sessionInfo()
R version 3.4.0 Patched (2017-05-10 r72670)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
attached base packages:
[1] stats4 parallel stats graphics grDevices utils
[7] datasets methods base
other attached packages:
[1] org.Tthermophila.eg.db_0.0.1 GenomeInfoDb_1.12.0
[3] AnnotationForge_1.18.0 AnnotationDbi_1.38.0
[5] IRanges_2.10.2 S4Vectors_0.14.2
[7] Biobase_2.36.2 AnnotationHub_2.8.1
[9] BiocGenerics_0.22.0
Thank you so much for your response!
I also need to load library(biomaRt), other than that everything works exactly as you said.
when
makeOrgPackageFromNCBI(version="0.0.1", author = "First Last Name <email@address.com>", maintainer = "First Last Name <email@address.com>", ".", tax_id = "312017", genus = "Tetrahymena", species= "thermophila")
it turns out
starting download for
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
rebuilding the cache
Error in result_create(conn@ptr, statement) :
no such table: main.gene2pubmed
could you give some advice?
many thanks!!
Well, the above line of code (still) works for me on the current version of R/BioC... (although run-time was longer than before [now ~20hrs]). If needed, I can sent you the OrgDb library I made.