OrgDb for Tetrahymena thermophila
1
0
Entering edit mode
@linyingzhang-13125
Last seen 7.5 years ago

I am trying to use clusterProfiler. There is no OrgDb object available for Tetrahymena thermophila. I only found a Inparanoid8Db object through Annotation Hub. Can I build OrgDb for Tetrahymena thermopile? How to do that?

annotation clustering • 1.6k views
ADD COMMENT
1
Entering edit mode
Guido Hooiveld ★ 4.1k
@guido-hooiveld-2020
Last seen 22 hours ago
Wageningen University, Wageningen, the …

Yes, you can build yourselves an OrgDb for Tetrahymena thermophile using the function makeOrgPackageFromNCBI() from the library AnnotationForge. You will only need the taxonomy ID of T. Thermophile, which apparently is 312017. Please note that this OrgDb contains (only) the annotation information available at the NCBI (for Tetrahymena thermophila SB210).

 

# download files from NCBI and create an annotation database named "org.Tthermophila.eg.db"
# you will need ~25GB disk space for this.
# running time function was ~ 7hr on my computer, during that time leave R session untouched.
# you can ignore the waring on removing the file './org.Tthermophila.eg.sqlite'

> # set working dir to a location with sufficient HDD space
> setwd("D:\\my\\favorite\\directory")

> # load required libraries.
>library(AnnotationForge)
>library(AnnotationDbi)
>library(GenomeInfoDb)

> #step below takes long time!
>makeOrgPackageFromNCBI(version="0.0.1", author = "First Last Name <email@address.com>", maintainer = "First Last Name <email@address.com>", ".", tax_id = "312017", genus = "Tetrahymena", species= "thermophila")

#Next install the generated org.Tthermophila.eg.db for use in R.
> install.packages(pkgs="./org.Tthermophila.eg.db", repos=NULL, type="source")
* installing *source* package 'org.Tthermophila.eg.db' ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
*** arch - i386
*** arch - x64
* DONE (org.Tthermophila.eg.db)

#check
> library(org.Tthermophila.eg.db)
> org.Tthermophila.eg.db
OrgDb object:
| DBSCHEMAVERSION: 2.1
| DBSCHEMA: NOSCHEMA_DB
| ORGANISM: Tetrahymena thermophila
| SPECIES: Tetrahymena thermophila
| CENTRALID: GID
| Taxonomy ID: 312017
| Db type: OrgDb
| Supporting package: AnnotationDbi

> columns(org.Tthermophila.eg.db)
 [1] "ACCNUM"      "ALIAS"       "ENTREZID"    "EVIDENCE"    "EVIDENCEALL"
 [6] "GENENAME"    "GID"         "GO"          "GOALL"       "ONTOLOGY"   
[11] "ONTOLOGYALL" "PMID"        "REFSEQ"      "SYMBOL"     
> keytypes(org.Tthermophila.eg.db)
 [1] "ACCNUM"      "ALIAS"       "ENTREZID"    "EVIDENCE"    "EVIDENCEALL"
 [6] "GENENAME"    "GID"         "GO"          "GOALL"       "ONTOLOGY"   
[11] "ONTOLOGYALL" "PMID"        "REFSEQ"      "SYMBOL"     
> head(keys(org.Tthermophila.eg.db))
[1] "7822955" "7822974" "7823109" "7823219" "7823307" "7823613"

> #Check: number of keys (genes) indeed corresponds to # genes listed @ NCBI [=26997]
> length(keys(org.Tthermophila.eg.db))
[1] 26997

> mykeys <- keys(org.Tthermophila.eg.db)[1:25]
> anno.result <- select(org.Tthermophila.eg.db, keys=mykeys, columns=c("ENTREZID","SYMBOL","GENENAME","ALIAS","GO"),keytype="ENTREZID")
'select()' returned 1:many mapping between keys and columns
> head(anno.result)
  ENTREZID          SYMBOL                   GENENAME           ALIAS         GO
1  7822955 TTHERM_00136120   60S ribosomal protein L6 TTHERM_00136120 GO:0022625
2  7822955 TTHERM_00136120   60S ribosomal protein L6 TTHERM_00136120 GO:0003735
3  7822955 TTHERM_00136120   60S ribosomal protein L6 TTHERM_00136120 GO:0002181
4  7822955 TTHERM_00136120   60S ribosomal protein L6 TTHERM_00136120 GO:0000027
5  7822974 TTHERM_00134940 60S ribosomal protein L23a TTHERM_00134940 GO:0022625
6  7822974 TTHERM_00134940 60S ribosomal protein L23a TTHERM_00134940 GO:0019843

 

> sessionInfo()
R version 3.4.0 Patched (2017-05-10 r72670)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1


attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils    
[7] datasets  methods   base     

other attached packages:
[1] org.Tthermophila.eg.db_0.0.1 GenomeInfoDb_1.12.0         
[3] AnnotationForge_1.18.0       AnnotationDbi_1.38.0        
[5] IRanges_2.10.2               S4Vectors_0.14.2            
[7] Biobase_2.36.2               AnnotationHub_2.8.1         
[9] BiocGenerics_0.22.0         

ADD COMMENT
0
Entering edit mode

Thank you so much for your response!

I also need to load library(biomaRt), other than that everything works exactly as you said.

ADD REPLY
0
Entering edit mode

when

makeOrgPackageFromNCBI(version="0.0.1", author = "First Last Name <email@address.com>", maintainer = "First Last Name <email@address.com>", ".", tax_id = "312017", genus = "Tetrahymena", species= "thermophila")

 

it turns out

starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
rebuilding the cache
Error in result_create(conn@ptr, statement) : 
  no such table: main.gene2pubmed

 

could you give some advice?

many thanks!!

 

ADD REPLY
0
Entering edit mode

Well, the above line of code (still) works for me on the current version of R/BioC... (although run-time was longer than before [now ~20hrs]). If needed, I can sent you the OrgDb library I made.

># set working dir to a location with sufficient HDD space
># needed >35GB
>setwd("D:\\my\\favorite\\directory")
>
># load required libraries.
>library(AnnotationForge)
>library(AnnotationDbi)
>library(GenomeInfoDb)

# create OrgDb package.
# Note: takes very long time (~20hrs)
> makeOrgPackageFromNCBI(version="0.0.1", author = "First Last Name <email@address.com>", maintainer = "First Last Name <email@address.com>", ".", tax_id = "312017", genus = "Tetrahymena", species= "thermophila")

If files are not cached locally this may take awhile to assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.
preparing data from NCBI ...
starting download for
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
rebuilding the cache
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
rebuilding the cache
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
rebuilding the cache
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
rebuilding the cache
extracting data for our organism from : gene_info
getting data for gene2go.gz
rebuilding the cache
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Please be patient while we work out which organisms can be annotated
  with ensembl IDs.
making the OrgDb package ...
Populating genes table:
genes table filled
Populating pubmed table:
pubmed table filled
Populating gene_info table:
gene_info table filled
Populating entrez_genes table:
entrez_genes table filled
Populating alias table:
alias table filled
Populating refseq table:
refseq table filled
Populating accessions table:
accessions table filled
Populating go table:
go table filled
table metadata filled

'select()' returned many:1 mapping between keys and columns
Dropping GO IDs that are too new for the current GO.db
Populating go table:
go table filled
'select()' returned many:1 mapping between keys and columns
Populating go_all table:
go_all table filled
Creating package in ./org.Tthermophila.eg.db
Now deleting temporary database file
complete!
[1] "org.Tthermophila.eg.sqlite"
There were 50 or more warnings (use warnings() to see the first 50)
>
> sessionInfo()
R version 3.5.1 Patched (2018-08-13 r75130)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
[1] GenomeInfoDb_1.16.0    AnnotationForge_1.22.2 AnnotationDbi_1.42.1  
[4] IRanges_2.14.11        S4Vectors_0.18.3       Biobase_2.40.0        
[7] BiocGenerics_0.26.0    BiocInstaller_1.30.0  

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18           magrittr_1.5           hms_0.4.2             
 [4] progress_1.2.0         bit_1.1-14             R6_2.2.2              
 [7] rlang_0.2.2            httr_1.3.1             stringr_1.3.1         
[10] blob_1.1.1             tools_3.5.1            DBI_1.0.0             
[13] assertthat_0.2.0       bit64_0.9-7            digest_0.6.17         
[16] crayon_1.3.4           GenomeInfoDbData_1.1.0 bitops_1.0-6          
[19] RCurl_1.95-4.11        biomaRt_2.36.1         memoise_1.1.0         
[22] RSQLite_2.1.1          stringi_1.2.4          GO.db_3.6.0           
[25] compiler_3.5.1         prettyunits_1.0.2      XML_3.98-1.16         
[28] pkgconfig_2.0.2       
>

 

ADD REPLY

Login before adding your answer.

Traffic: 847 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6