how to create an OrgDb package?
1
0
Entering edit mode
Haibo Liu ▴ 20
@haibol2017-23658
Last seen 4 weeks ago
United States

Dear Bioconductor community,

I am trying to build an OrgDb for some custom genome with a GTF annotation file. I tried to start with the human genome using the makeOrgPackage from the AnnotationForge package. A few head lines of the input file “GRCh38.gene.info.txt” are shown as below. The resulting package is installed successfully, but it can Not be queried.

Thank you very much if you have any thoughts/comments/solutions.

Haibo

library("AnnotationForge")
library("AnnotationDbi")

gene_information <- read.delim("GRCh38.gene.info.txt", header = FALSE)

head(gene_information)
           V1              V2          V3

1 ENSG00000243485 ENST00000473358 MIR1302-2HG
2 ENSG00000243485 ENST00000469289 MIR1302-2HG
3 ENSG00000237613 ENST00000417324 FAM138A
4 ENSG00000237613 ENST00000461467 FAM138A
5 ENSG00000186092 ENST00000641515 OR4F5
6 ENSG00000186092 ENST00000335137 OR4F5

fSym <- unique(gene_information[, c(1,3)])
colnames(fSym) <- c("GID", "SYMBOL")

ensembl_trans <- unique(gene_information[, c(1:2)])
colnames(ensembl_trans) <- c("GID", "ENSEMBLTRANS")

ensembl <- unique(gene_information[, c(1,1)])
colnames(ensembl) <- c("GID", "ENSEMBL")

#tmpdir <- tempdir()
tmpdir <- "test2"
if (!dir.exists(tmpdir))
{
    dir.create(tmpdir)
}

makeOrgPackage(gene_info = fSym, 
               ensembl_trans = ensembl_trans,
               ensembl = ensembl,
               version = "0.1",
               maintainer = "Some One so@someplace.org",
               author = "Some One so@someplace.org",
               outputDir = tmpdir,
               tax_id= "9606",
               genus= "Homo",
               species= "sapiens",
               goTable=NULL)

install.packages(file.path(tmpdir, "org.Hsapiens.eg.db"), 
                 type = "source", repos=NULL)


library("org.Hsapiens.eg.db")
AnnotationDbi::select(org.Hsapiens.eg.db, keys = "ENSG00000243485", columns = "SYMBOL", keytype = "ENSEMBL")

## Error message :
Error in names(ans) <- unlist(make.name.tree(x, recursive, what.names),  : 
  attempt to set an attribute on NULL

sessionInfo( )

R version 4.1.0 (2021-05-18) Platform: i386-w64-mingw32/i386 (32-bit) Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats4 parallel stats graphics grDevices [6] utils datasets methods base

other attached packages: [1] RSQLite_2.2.7 org.Hsapiens.eg.db_0.1 [3] AnnotationForge_1.34.1 AnnotationDbi_1.54.1
[5] IRanges_2.26.0 S4Vectors_0.30.0
[7] Biobase_2.52.0 BiocGenerics_0.38.0

loaded via a namespace (and not attached): [1] KEGGREST_1.32.0 tidyselect_1.1.1
[3] xfun_0.25 purrr_0.3.4
[5] colorspace_2.0-2 vctrs_0.3.8
[7] generics_0.1.0 htmltools_0.5.1.1
[9] yaml_2.2.1 XML_3.99-0.8
[11] utf8_1.2.2 blob_1.2.2
[13] rlang_0.4.11 pillar_1.6.3
[15] glue_1.4.2 DBI_1.1.1
[17] bit64_4.0.5 GenomeInfoDbData_1.2.6 [19] lifecycle_1.0.1 zlibbioc_1.38.0
[21] Biostrings_2.60.2 munsell_0.5.0
[23] gtable_0.3.0 memoise_2.0.0
[25] evaluate_0.14 knitr_1.36
[27] fastmap_1.1.0 GenomeInfoDb_1.28.4
[29] fansi_0.5.0 Rcpp_1.0.7
[31] scales_1.1.1 BiocManager_1.30.16
[33] cachem_1.0.5 XVector_0.32.0
[35] bit_4.0.4 ggplot2_3.3.5
[37] png_0.1-7 digest_0.6.27
[39] dplyr_1.0.7 cowplot_1.1.1
[41] grid_4.1.0 tools_4.1.0
[43] bitops_1.0-7 magrittr_2.0.1
[45] RCurl_1.98-1.3 tibble_3.1.3
[47] crayon_1.4.1 pkgconfig_2.0.3
[49] ellipsis_0.3.2 rstudioapi_0.13
[51] assertthat_0.2.1 rmarkdown_2.11
[53] httr_1.4.2 R6_2.5.1
[55] compiler_4.1.0

orgDb annotationForge • 5.7k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

It's a bug. The release branch is frozen, so I'll patch the devel branch and it will be part of the new release. To use the fixed package you will need to use devel until the release.

It will take 24-48 hours to propagate (or if you want to hang with the cool kids, you can use BiocManager::install("jmacdon/AnnotationDbi") after an hour or so to get the fix). You are looking for version 1.55.2.

ADD COMMENT
0
Entering edit mode
> select(org.Hsapiens.eg.db, keys = "ENSG00000243485", columns = "SYMBOL", keytype = "ENSEMBL")
'select()' returned 1:1 mapping between keys and columns
          ENSEMBL      SYMBOL
1 ENSG00000243485 MIR1302-2HG
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] org.Hsapiens.eg.db_0.1 AnnotationDbi_1.55.2   IRanges_2.26.0        
[4] S4Vectors_0.30.2       Biobase_2.52.0         BiocGenerics_0.38.0   
[7] BiocManager_1.30.16   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7             XVector_0.32.0         compiler_4.1.0        
 [4] GenomeInfoDb_1.28.4    zlibbioc_1.38.0        prettyunits_1.1.1     
 [7] bitops_1.0-7           remotes_2.4.1          tools_4.1.0           
[10] pkgbuild_1.2.0         bit_4.0.4              RSQLite_2.2.8         
[13] memoise_2.0.0          pkgconfig_2.0.3        png_0.1-7             
[16] rlang_0.4.11           DBI_1.1.1              cli_3.0.1             
[19] rstudioapi_0.13        curl_4.3.2             fastmap_1.1.0         
[22] GenomeInfoDbData_1.2.6 withr_2.4.2            httr_1.4.2            
[25] Biostrings_2.60.2      vctrs_0.3.8            rprojroot_2.0.2       
[28] bit64_4.0.5            R6_2.5.1               processx_3.5.2        
[31] callr_3.7.0            blob_1.2.2             ps_1.6.0              
[34] KEGGREST_1.32.0        RCurl_1.98-1.5         cachem_1.0.6          
[37] crayon_1.4.1

I cheated and installed in a release R/Bioconductor. I wouldn't recommend that, and nobody will provide support if you do so. Put a different way, if you have a problem and we see that you have mixed'n'matched package versions, the first response will be for you to run 'BiocManager::valid()`, which will undo the mixing. So you should either wait for the release next week, or install R-4.1.2 and Bioc-devel.

ADD REPLY
0
Entering edit mode

Thank you so much, James, for the quick response and the fix. I will wait for the release next week.

Haibo

ADD REPLY
0
Entering edit mode

Hi Bioconductor team and Haibo,

I am also trying to generate an OrgDb object for pig using the same script shared above.

library(AnnotationForge)
gene_information <- read.table("all.genes.unique.stable.geneID.transcriptID.gene.name.mart_export.txt",sep="\t",fill = TRUE,header = FALSE)
# > dim(gene_information)
# [1] 63041     3

# > head(gene_information)
# V1                 V2  V3
# 2 ENSSSCG00000018060 ENSSSCT00000019655    
# 3 ENSSSCG00000018061 ENSSSCT00000019656    
# 4 ENSSSCG00000018062 ENSSSCT00000019657    
# 5 ENSSSCG00000018063 ENSSSCT00000019658    
# 6 ENSSSCG00000018064 ENSSSCT00000019659    
# 7 ENSSSCG00000018065 ENSSSCT00000019660 ND1

fSym <- unique(gene_information[, c(1,3)])
colnames(fSym) <- c("GID", "SYMBOL")

ensembl_trans <- unique(gene_information[, c(1:2)])
colnames(ensembl_trans) <- c("GID", "ENSEMBLTRANS")

ensembl <- unique(gene_information[, c(1,1)])
colnames(ensembl) <- c("GID", "ENSEMBL")

tmpdir <- "test1"
if (!dir.exists(tmpdir))
{
  dir.create(tmpdir)
}


makeOrgPackage(gene_info = fSym, 
               ensembl_trans = ensembl_trans,
               ensembl = ensembl,
               version = "0.1",
               maintainer = "Some One so@someplace.org",
               author = "Some One so@someplace.org",
               outputDir = tmpdir,
               tax_id= "9823",
               genus= "Sus",
               species= "scrofa",
               goTable=NULL,
               Description= "A org.Sscrofa.eg.db")

When I have the "Description" argument, it would report error below though I do have first column as 'GID' :

Error in .makeOrgPackage(data, version = version, maintainer = maintainer,  : 
  The 1st column must always be the gene ID 'GID'

When I do not have the "Description" argument

makeOrgPackage(gene_info = fSym, 
               ensembl_trans = ensembl_trans,
               ensembl = ensembl,
               version = "0.1",
               maintainer = "Some One so@someplace.org",
               author = "Some One so@someplace.org",
               outputDir = tmpdir,
               tax_id= "9823",
               genus= "Sus",
               species= "scrofa",
               goTable=NULL)
install.packages(file.path(tmpdir, "org.Sscrofa.eg.db"), 
                 type = "source", repos=NULL)

it complians about the DESCRIPTION file. Can you give me some suggestions?

Installing package into ‘/work/abg/pyang19/R/x86_64-pc-linux-gnu-library/4.0’
(as ‘lib’ is unspecified)
* installing *source* package ‘org.Sscrofa.eg.db’ ...
** using staged installation
Error : Invalid DESCRIPTION file

Malformed maintainer field.

See section 'The DESCRIPTION file' in the 'Writing R Extensions'
manual.

ERROR: installing package DESCRIPTION failed for package ‘org.Sscrofa.eg.db’
* removing ‘/work/abg/pyang19/R/x86_64-pc-linux-gnu-library/4.0/org.Sscrofa.eg.db’
Warning message:
In install.packages(file.path(tmpdir, "org.Sscrofa.eg.db"), type = "source",  :
  installation of package ‘test3/org.Sscrofa.eg.db’ had non-zero exit status

Thanks in advance!

ADD REPLY
0
Entering edit mode
> sessionInfo( )
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.9 (Maipo)

Matrix products: default
BLAS/LAPACK: /opt/rit/spack-app/linux-rhel7-x86_64/gcc-7.3.0/openblas-0.3.10-cbsvgq24lsaf6xz65lbdlb6jh4b7qsaa/lib/libopenblasp-r0.3.10.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] AnnotationForge_1.32.0                   
 [2] org.Ss.eg.db_3.12.0                      
 [3] TxDb.Sscrofa.UCSC.susScr11.refGene_3.12.0
 [4] GenomicFeatures_1.42.2                   
 [5] AnnotationDbi_1.52.0                     
 [6] BSgenome.Sscrofa.UCSC.susScr11_1.4.2     
 [7] ArchR_1.0.1                              
 [8] magrittr_2.0.1                           
 [9] rhdf5_2.34.0                             
[10] Matrix_1.4-0                             
[11] data.table_1.14.0                        
[12] SummarizedExperiment_1.20.0              
[13] Biobase_2.50.0                           
[14] MatrixGenerics_1.2.1                     
[15] matrixStats_0.58.0                       
[16] ggplot2_3.3.5                            
[17] BSgenome_1.58.0                          
[18] rtracklayer_1.50.0                       
[19] Biostrings_2.58.0                        
[20] XVector_0.30.0                           
[21] GenomicRanges_1.42.0                     
[22] GenomeInfoDb_1.26.7                      
[23] IRanges_2.24.1                           
[24] S4Vectors_0.28.1                         
[25] BiocGenerics_0.36.1                      
[26] SeuratObject_4.0.4                       
[27] Seurat_4.1.0                             

loaded via a namespace (and not attached):
  [1] BiocFileCache_1.14.0     plyr_1.8.6               igraph_1.2.6            
  [4] lazyeval_0.2.2           splines_4.0.3            BiocParallel_1.24.1     
  [7] listenv_0.8.0            scattermore_0.7          digest_0.6.27           
 [10] htmltools_0.5.1.1        fansi_0.4.2              memoise_2.0.0           
 [13] tensor_1.5               cluster_2.1.1            ROCR_1.0-11             
 [16] globals_0.14.0           askpass_1.1              spatstat.sparse_2.0-0   
 [19] prettyunits_1.1.1        colorspace_2.0-0         rappdirs_0.3.3          
 [22] blob_1.2.1               ggrepel_0.9.1            dplyr_1.0.5             
 [25] crayon_1.4.1             RCurl_1.98-1.3           jsonlite_1.7.2          
 [28] spatstat.data_2.0-0      survival_3.2-10          zoo_1.8-9               
 [31] glue_1.4.2               polyclip_1.10-0          gtable_0.3.0            
 [34] zlibbioc_1.36.0          leiden_0.3.7             DelayedArray_0.16.3     
 [37] Rhdf5lib_1.12.1          future.apply_1.7.0       abind_1.4-5             
 [40] scales_1.1.1             DBI_1.1.1                miniUI_0.1.1.1          
 [43] Rcpp_1.0.7               progress_1.2.2           viridisLite_0.3.0       
 [46] xtable_1.8-4             reticulate_1.18          spatstat.core_1.65-5    
 [49] bit_4.0.4                htmlwidgets_1.5.3        httr_1.4.2              
 [52] RColorBrewer_1.1-2       ellipsis_0.3.1           ica_1.0-2               
 [55] pkgconfig_2.0.3          XML_3.99-0.6             dbplyr_2.1.0            
 [58] uwot_0.1.10              deldir_0.2-10            utf8_1.2.1              
 [61] tidyselect_1.1.0         rlang_0.4.10             reshape2_1.4.4          
 [64] later_1.1.0.1            munsell_0.5.0            tools_4.0.3             
 [67] cachem_1.0.4             generics_0.1.0           RSQLite_2.2.4           
 [70] ggridges_0.5.3           stringr_1.4.0            fastmap_1.1.0           
 [73] goftest_1.2-2            bit64_4.0.5              fitdistrplus_1.1-3      
 [76] purrr_0.3.4              RANN_2.6.1               pbapply_1.4-3           
 [79] future_1.21.0            nlme_3.1-152             mime_0.10               
 [82] xml2_1.3.2               biomaRt_2.46.3           compiler_4.0.3          
 [85] rstudioapi_0.13          curl_4.3                 plotly_4.10.0           
 [88] png_0.1-7                spatstat.utils_2.1-0     tibble_3.1.0            
 [91] stringi_1.5.3            lattice_0.20-41          vctrs_0.3.6             
 [94] pillar_1.5.1             lifecycle_1.0.0          rhdf5filters_1.2.1      
 [97] spatstat.geom_1.65-5     lmtest_0.9-38            RcppAnnoy_0.0.18        
[100] cowplot_1.1.1            bitops_1.0-6             irlba_2.3.3             
[103] httpuv_1.5.5             patchwork_1.1.1          R6_2.5.0                
[106] promises_1.2.0.1         KernSmooth_2.23-18       gridExtra_2.3           
[109] parallelly_1.24.0        codetools_0.2-18         MASS_7.3-53.1           
[112] assertthat_0.2.1         openssl_1.4.3            withr_2.4.1             
[115] GenomicAlignments_1.26.0 sctransform_0.3.3        Rsamtools_2.6.0         
[118] GenomeInfoDbData_1.2.4   hms_1.0.0                mgcv_1.8-34             
[121] grid_4.0.3               rpart_4.1-15             tidyr_1.1.3             
[124] Cairo_1.5-12.2           Rtsne_0.15               shiny_1.6.0    
ADD REPLY
0
Entering edit mode

Regarding this:

Error : Invalid DESCRIPTION file

Malformed maintainer field.

I have also been facing this error... see: problem with makeOrgPackageFromNCBI when making an annotation package

Basically you have to put the email address between < and >. You forgot to include these. Thus it should be:

maintainer = "Some One <so@someplace.org>",
author = "Some One <so@someplace.org>",

Having said this, are you aware that for each ENSEMBL release these databases are made available for use in Bioconductor through the so-called AnnnotationHub by Johannes Rainer? So there may be no need for you to do what you are doing. See for more on this e.g. here ensembldb EnsDb databases for Ensembl release 101 added to AnnotationHub and EnsDb.Rnorvegicus for Rnor6.

ADD REPLY
0
Entering edit mode

Hey Guido, A million thanks!! its working!! Penny

ADD REPLY

Login before adding your answer.

Traffic: 500 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6