Hi,
I'm trying to run create an annotation database for Agaricus bisporus through NCBI in AnnotationForge, but I get a couple of errors:
Error in makeOrgDbFromDataFrames(data, tax_id, genus, species, dbFileName, :
'goTable' GO Ids must be formatted like 'GO:XXXXXXX'
In addition: Warning messages:
1: RSQLite::dbGetPreparedQuery() is deprecated, please switch to DBI::dbGetQuery(params = bind.data).
2: Named parameters not used in query: genes
3: Named parameters not used in query: name, value
How do I work around the deprecated RSQLite::dbGetPreparedQuery() function? The full script is given below along with sessonInfo. Furthermore, when I open the gene2go file the GO IDs seem fine so not sure why the go Table is not recognizing the IDs. Does anybody have an idea why the GO IDs are not recognized (I have pasted the top rows from the gene2go file that AnnotationForge obtained from NCBI at the bottom of this page)?
My script is:
> library(AnnotationDbi)
> library(GenomeInfoDb)
> library(biomaRt)
> library(survival)
> libraryUniProt.ws)
Loading required package: RCurl
Loading required package: bitops
> library(knitr)
> library(DBI)
> library(mclust)
> makeOrgPackageFromNCBI(version = "0.1",
+ author = "my name",
+ maintainer = "email.com",
+ outputDir = ".",
+ tax_id = "936046",
+ genus = "Agaricus",
+ species = "bisporus")
If files are not cached locally this may take awhile to assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.
preparing data from NCBI ...
starting download for 5 data files
getting data for gene2pubmed.gz
rebuilding the cache
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
rebuilding the cache
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
rebuilding the cache
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
rebuilding the cache
extracting data for our organism from : gene_info
getting data for gene2go.gz
rebuilding the cache
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Please be patient while we work out which organisms can be annotated with
ensembl IDs.
making the OrgDb package ...
Populating genes table:
genes table filled
Populating pubmed table:
pubmed table filled
Populating chromosomes table:
chromosomes table filled
Populating gene_info table:
gene_info table filled
Populating entrez_genes table:
entrez_genes table filled
Populating alias table:
alias table filled
Populating refseq table:
refseq table filled
Populating accessions table:
accessions table filled
Populating go table:
go table filled
table metadata filled
Error in makeOrgDbFromDataFrames(data, tax_id, genus, species, dbFileName, :
'goTable' GO Ids must be formatted like 'GO:XXXXXXX'
In addition: Warning messages:
1: RSQLite::dbGetPreparedQuery() is deprecated, please switch to DBI::dbGetQuery(params = bind.data).
2: Named parameters not used in query: genes
3: Named parameters not used in query: name, value
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_Ireland.1252 LC_CTYPE=English_Ireland.1252
[3] LC_MONETARY=English_Ireland.1252 LC_NUMERIC=C
[5] LC_TIME=English_Ireland.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods
[9] base
other attached packages:
[1] mclust_5.2.2 DBI_0.5-1 knitr_1.15.1
[4] UniProt.ws_2.14.0 RCurl_1.95-4.8 bitops_1.0-6
[7] survival_2.40-1 biomaRt_2.30.0 GenomeInfoDb_1.10.3
[10] AnnotationHub_2.6.4 AnnotationForge_1.16.0 AnnotationDbi_1.36.2
[13] IRanges_2.8.1 S4Vectors_0.12.1 Biobase_2.34.0
[16] BiocGenerics_0.20.0 RSQLite_1.1-2
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 splines_3.3.2
[3] lattice_0.20-34 xtable_1.8-2
[5] R6_2.2.0 httr_1.2.1
[7] tools_3.3.2 grid_3.3.2
[9] htmltools_0.3.5 yaml_2.1.14
[11] digest_0.6.12 interactiveDisplayBase_1.12.0
[13] Matrix_1.2-8 shiny_1.0.0
[15] memoise_1.0.0 mime_0.5
[17] BiocInstaller_1.24.0 XML_3.98-1.5
[19] httpuv_1.3.3
An example of the gene2go file obtained from NCBI is:
#tax_id | GeneID | GO_ID | Evidence | Qualifier | GO_term | PubMed | Category |
3702 | 814629 | GO:0005634 | ISM | - | nucleus | - | Component |
3702 | 814629 | GO:0008150 | ND | - | biological_process | - | Process |
3702 | 814630 | GO:0003677 | IEA | - | DNA binding | - | Function |
3702 | 814630 | GO:0003700 | ISS | - | transcription factor activity, sequence-specific DNA binding | 11118137 | Function |
3702 | 814630 | GO:0005634 | IEA | - | nucleus | - | Component |
3702 | 814630 | GO:0005634 | ISM | - | nucleus | - | Component |
3702 | 814630 | GO:0006351 | IEA | - | transcription, DNA-templated | - | Process |
If you want to comment on a post, please click the ADD COMMENT link and type in the box that appears. The 'Add your answer' box below is intended for answers.
While you did show some rows from gene2go, you should note that the taxonomic ID for those rows (the first column) is 3702, which is Arabidopsis thaliana, not Agaricus bisporus. There are no rows in the gene2go file that have 936046 in the first column, hence no data parsed out for your GO table.
ok, thanks. I did not see that. Any idea why it obtained Arabidopsis thaliana GO ID's and not Agaricus bisporus? I'll try to see if I can source the GO IDs some where else and use the makeOrgPackage(). Thanks again for your help.
The gene2go file that is downloaded is a generic file that contains Entrez Gene ID -> GO ID mappings for all the species that NCBI has currently annotated. It just so happens that A. thaliana is at the top of the file. The function
makeOrgPackageFromNCBI
downloads all these generic files, then extracts data that are specific to whatever species you are interested in, and uses those data to build the orgDb package.In the case of GO mappings, there are no mappings for your species in gene2go. So the function then queries blast2go, and gets all the mappings they have. It so happens that there are 42 (or 44? I forget) mappings for your species in blast2go, but unfortunately there aren't any Entrez Gene IDs associated with those GO terms, so they get dropped as well. In the end, there aren't any Entrez Gene -> GO mappings that
makeOrgPackageFromNCBI
can find, so you end up with an orgDb package that has everything but the GO table.ok, thanks for the information. I really appreciate it. I have found GO annotation for Agaricus bisporus on the JGI website for that species. I've downloaded it and will attempt to construct a database using that.