Question

makeOrgPackageFromGAF() error with gene ID "GID"

0

Entering edit mode

Anna • 0

@e8300812

Last seen 4 weeks ago

United States

I am trying to create a TxDb for a non-model organism. The genome is annotated on NCBI and the associated GO files, etc are on there. So, I am trying to use the makeTxDbFromGAF() function. I continue to get the same error no matter what I try, that the first column must be the gene ID "GID".


library("AnnotationForge")
library("BaseSet")
library("dplyr")
library("BaseSet")

gaf<-getGAF("gene_ontology.gaf")
gaf_data <- as.data.frame(gaf)

head(gaf_data)


colnames(gaf_data)[colnames(gaf_data) == "DB_Object_ID"] <- "GID"

gaf_data <- gaf_data %>% select(GID, everything())

head(gaf_data)


makeOrgPackageFromGAF <- function(gaf_data, output_path) {
  makeOrgPackage(
    gene_info = gaf_data,
    organism = "Haemorhous mexicanus",
    version = "0.1",
    maintainer = "Anna Perez-Umphrey <aperezumphrey@gmail.com>",
    author = "Anna Perez-Umphrey <aperezumphrey@gmail.com>",
    outputDir = output_path,
    tax_id = "30427"
  )
}

output_path <- "C:/Users/apere/Desktop/GO"
makeOrgPackageFromGAF(gaf_data, output_path)

#Error in .makeOrgPackage(data, version = version, maintainer = maintainer,  : 
#  The 1st column must always be the gene ID 'GID'

sessionInfo( )

R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BaseSet_0.9.0          dplyr_1.1.3            readr_2.1.5            GenomicFeatures_1.54.4 GenomicRanges_1.54.1  
 [6] GenomeInfoDb_1.38.5    AnnotationForge_1.44.0 AnnotationDbi_1.64.1   IRanges_2.36.0         S4Vectors_0.40.2      
[11] Biobase_2.62.0         BiocGenerics_0.48.1   

loaded via a namespace (and not attached):
 [1] DBI_1.2.3                   bitops_1.0-7                biomaRt_2.58.2              rlang_1.1.1                
 [5] magrittr_2.0.3              matrixStats_1.2.0           compiler_4.3.1              RSQLite_2.3.4              
 [9] mgcv_1.8-42                 png_0.1-8                   vctrs_0.6.4                 stringr_1.5.1              
[13] pkgconfig_2.0.3             crayon_1.5.3                fastmap_1.1.1               dbplyr_2.5.0               
[17] XVector_0.42.0              utf8_1.2.4                  Rsamtools_2.18.0            tzdb_0.4.0                 
[21] nloptr_2.1.1                bit_4.0.5                   zlibbioc_1.48.0             cachem_1.0.8               
[25] progress_1.2.3              blob_1.2.4                  DelayedArray_0.28.0         BiocParallel_1.36.0        
[29] parallel_4.3.1              prettyunits_1.2.0           R6_2.5.1                    stringi_1.8.4              
[33] rtracklayer_1.62.0          pkgload_1.4.0               boot_1.3-28.1               lubridate_1.9.3            
[37] numDeriv_2016.8-1.1         estimability_1.5.1          Rcpp_1.0.11                 SummarizedExperiment_1.32.0
[41] Matrix_1.6-5                splines_4.3.1               timechange_0.3.0            glmmTMB_1.1.9              
[45] tidyselect_1.2.1            rstudioapi_0.16.0           abind_1.4-5                 yaml_2.3.7                 
[49] TMB_1.9.14                  codetools_0.2-19            curl_5.2.0                  lattice_0.21-8             
[53] tibble_3.2.1                withr_3.0.0                 KEGGREST_1.42.0             coda_0.19-4.1              
[57] BiocFileCache_2.10.2        xml2_1.3.6                  Biostrings_2.70.1           pillar_1.9.0               
[61] BiocManager_1.30.23         filelock_1.0.3              MatrixGenerics_1.14.0       generics_0.1.3             
[65] vroom_1.6.5                 RCurl_1.98-1.14             hms_1.1.3                   minqa_1.2.7                
[69] xtable_1.8-4                glue_1.6.2                  emmeans_1.10.3              tools_4.3.1                
[73] BiocIO_1.12.0               lme4_1.1-35.5               GenomicAlignments_1.38.2    mvtnorm_1.2-5              
[77] XML_3.99-0.17               grid_4.3.1                  nlme_3.1-162                GenomeInfoDbData_1.2.11    
[81] restfulr_0.0.15             cli_3.6.1                   rappdirs_0.3.3              fansi_1.0.5                
[85] S4Arrays_1.2.0              digest_0.6.33               SparseArray_1.2.3           rjson_0.2.21               
[89] memoise_2.0.1               lifecycle_1.0.4             httr_1.4.7                  GO.db_3.18.0               
[93] bit64_4.0.5                 MASS_7.3-60

AnnotationHub AnnotationDbi AnnotationForge • 391 views

ADD COMMENT • link 5 weeks ago • updated 4 weeks ago Anna • 0

0

Entering edit mode

Are you trying to make a TxDb or an OrgDb? You say the former, but then appear to be trying to make the latter.

ADD REPLY • link 4 weeks ago James W. MacDonald 67k

score 0 · Answer 1 · 2024-08-06

Assuming that you want an OrgDb and not a TxDb, it's simple enough to make your own. Note that the following requires the download of several large files from NCBI, so you can either set options(timeout = 1e6) or just go to https://ftp.ncbi.nlm.nih.gov/gene/DATA/ and download all the gene2xxx.gz files as well as gene_info.gz. That's what I normally do because having R download serially is slower than downloading in parallel. But ymmv.

Anyway, you can then use `AnnotationForge' to build the package. Note that there are two steps to that process; first all the data are dumped into an omnibus SQLite database called NCBI.sqlite, and then that database is queried to get data for your species. I did the first step yesterday (it takes a while), and now I can just use the NCBI.sqlite db directly, by choosing rebuildCache = FALSE. You will not be able to do that! So don't change the rebuildCache argument.

To be clear! The rebuildCache argument is asking if you want to re-generate the NCBI.sqlite database. If you have one, and it's not too old, it's way faster to just use it again. But if you have just downloaded all the files, you won't have that database, so you have to generate it the first time, in which case you keep rebuildCache = TRUE.

> library(AnnotationForge)
## change to the correct dir if you already have the files!
> makeOrgPackageFromNCBI("0.0.1", "me <me@mine.org>", "me", ".", "30427", "Haemorhous", "mexicanus", ".", rebuildCache = FALSE)
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
making the OrgDb package ...
Populating genes table:
genes table filled
Populating pubmed table:
pubmed table filled
Populating chromosomes table:
chromosomes table filled
Populating gene_info table:
gene_info table filled
Populating entrez_genes table:
entrez_genes table filled
Populating alias table:
alias table filled
Populating refseq table:
refseq table filled
Populating accessions table:
accessions table filled
Populating go table:
go table filled
table metadata filled

'select()' returned many:1 mapping between keys and columns
Dropping GO IDs that are too new for the current GO.db
Populating go table:
go table filled
Populating go_bp table:
go_bp table filled
Populating go_cc table:
go_cc table filled
Populating go_mf table:
go_mf table filled
'select()' returned many:1 mapping between keys and columns
Populating go_bp_all table:
go_bp_all table filled
Populating go_cc_all table:
go_cc_all table filled
Populating go_mf_all table:
go_mf_all table filled
Populating go_all table:
go_all table filled
Creating package in ./org.Hmexicanus.eg.db 
Now deleting temporary database file
complete!
[1] "org.Hmexicanus.eg.sqlite"
Warning messages:
1: In dir.create(tdir) : 'TEMPANNOTPACKAGEDIRFORFILTERING' already exists
2: In file.remove(dbFileName) :
  cannot remove file './org.Hmexicanus.eg.sqlite', reason 'Permission denied'

## You can ignore those warnings. 
## we need to install the package - since you are on Windows you need to specify the type argument as well

> install.packages("org.Hmexicanus.eg.db/", repos = NULL, type = "source")
Installing package into 'C:/Users/jmacdon/AppData/Local/R/win-library/4.4'
(as 'lib' is unspecified)
* installing *source* package 'org.Hmexicanus.eg.db' ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
Warning messages:
1: package 'IRanges' was built under R version 4.4.1 
2: package 'S4Vectors' was built under R version 4.4.1 
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (org.Hmexicanus.eg.db)

## Load the package
> library(org.Hmexicanus.eg.db)

## how many NCBI gene IDs do we have?
> length(keys(org.Hmexicanus.eg.db))
[1] 19966

## Seems legit
## we can also use the show() method to get info

> org.Hmexicanus.eg.db
OrgDb object:
| DBSCHEMAVERSION: 2.1
| DBSCHEMA: NOSCHEMA_DB
| ORGANISM: Haemorhous mexicanus
| SPECIES: Haemorhous mexicanus
| CENTRALID: GID
| Taxonomy ID: 30427
| Db type: OrgDb
| Supporting package: AnnotationDbi

Please see: help('select') for usage information