Hi Marc and Others,
I am trying and learning to use makeOrgPackageFromNCBI() to make organism packages, but always encounter some problems during the process. Therefore, I really hope to get some suggestions and thank you a lot!
Please see the three detailed problems below(Maybe the problems are too many, but really hope to get some hints from you, thank you again.):
1> I run those functions in R version 3.3.1 and Windows 7.
2> As I have download those files needed for the function: gene2pubmed.gz, gene2accession.gz, gene2refseq.gz, gene_info.gz, gene2go.gz, NCBI.sqlite, idmapping_selected.tab.gz, the codes are shown below:
a. The first error-'error in statement: no such table: altGO_date'!
library(AnnotationForge)
library(AnnotationDbi)
library(GenomeInfoDb)
library(biomaRt)
makeOrgPackageFromNCBI(
version="0.1",
maintainer="Guido Hooiveld <guido.hooiveld@wur.nl>",
author="Guido Hooiveld <guido.hooiveld@wur.nl>",
outputDir=".",
tax_id='10029',
genus="Cricetulus",
species="griseus",
NCBIFilesDir = ".",
rebuildCache=F)
preparing data from NCBI ...
starting download for 5 data files
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Error in sqliteSendQuery(con, statement, bind.data) :
error in statement: no such table: altGO_date
b. The second error. when I set rebuildCache=T, it occurs 'Error in file(description = tmp, open = "r") : object 'tmp' not found'!
makeOrgPackageFromNCBI(
version="0.1",
maintainer="Guido Hooiveld <guido.hooiveld@wur.nl>",
author="Guido Hooiveld <guido.hooiveld@wur.nl>",
outputDir=".",
tax_id='10029',
genus="Cricetulus",
species="griseus")
If files are not cached locally this may take awhile to assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.
preparing data from NCBI ...
starting download for 5 data files
getting data for gene2pubmed.gz
Error in file(description = tmp, open = "r") : object 'tmp' not found
c. The third error. When I tried some other organism, it occured another problem-'Error in FUN(X[[i]], ...) : ?Please use 'available.species' to see viable species names or tax Ids'!
makeOrgPackageFromNCBI(version = "0.0.1",
author = "me",
maintainer = "me <me@mine.org>",
outputDir = ".",
tax_id = '7227',
genus = "Drosophila",
species = 'Drosophila melanogaster',
NCBIFilesDir = ".",
rebuildCache=F)
or
makeOrgPackageFromNCBI(version = "0.0.1",
author = "me",
maintainer = "me <me@mine.org>",
outputDir = ".",
tax_id = '7227',
genus = "Drosophila",
species = 'melanogaster',
NCBIFilesDir = ".",
rebuildCache=F)
Both of them show the same problem:
preparing data from NCBI ...
starting download for 6 data files
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene2unigene
extracting data for our organism from : gene2unigene
getting all data for our organism from : gene2unigene
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Loading required package: httr
Attaching package: httr?
The following object is masked from package:Biobase?
content
Loading required package: RCurl
Loading required package: bitops
Error in FUN(X[[i]], ...) :
?Please use 'available.species' to see viable species names or tax Ids
Here are my session informations:
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.28.0 GenomeInfoDb_1.8.3 AnnotationForge_1.14.2 AnnotationDbi_1.34.4
[5] IRanges_2.6.1 S4Vectors_0.10.2 Biobase_2.32.0 BiocGenerics_0.18.0
loaded via a namespace (and not attached):
[1] rsconnect_0.4.3 DBI_0.4-1 tools_3.3.1 RCurl_1.95-4.8 RSQLite_1.0.0 bitops_1.0-6
[7] XML_3.98-1.4
Looking forward to your response~~
Thanks,
Shisheng
If you want to respond, use the ADD COMMENT button and type in the box that comes up. If you use the Add your answer box, it looks like you are answering your own question, which you are not doing.
As Marc pointed out, you can simply use the OrgDb on AnnotationHub.
As to your error, the help page for that function says
genus: Single string indicating the genus.
species: Single string indicating the species.
And the species in this situation is "melanogaster", not "Drosophila melanogaster", which is the genus and species.
Hi James,
Thank you for your warning and answer. This is my first time to use the website Bioconductor, I will note that next time!
For my question, first, I know I could find the Drosophila OrgDb by 'AnnotationHub', I just want to try and learn the function 'makeOrgPackageFromNCBI' to see whether it works in my computer^_^;
Second, I have seen the help page for that function and tried to only use 'melanogaster' for the 'species', but nothing could help, it still occured the same problem. What's more, I checked the 'available.species':
> spec <- available.species()
> spec[which(as.numeric(spec$taxon)==7227),]
taxon species
10836 7227 Drosophila melanogaster
As you can see, it shows me the 'species'-'Drosophila melanogaster'. I even tried 'Drosophila_melanogaster' or 'Drosophilamelanogaster', but the problem is always there;
Third, 'Drosophila melanogaster' is just an example, which is not my objective organism. As the above, I posted three problems for trying different examples (none of them is my studying object) in my computer, I just want to learn this awesome function for my future research.
Therefore, I really need your help to fix the three problems of the function 'makeOrgPackageFromNCBI' in my computer. Please do not advise me to give up the function...
Thank you quite a lot^_^
Shisheng
Well, you don't say what this mysterious species is, but if I assume it's Cricetulus griseus, then
OR, as Marc already pointed out, there are literally (yes, literally!) thousands of species in the AnnotationHub, this one being represented twice.
Hi James,
So strange for the problem. Well, it is OK for your computer, not for mine. And could you help me check whether there is something for my studying object by using 'makeOrgPackageFromNCBI' function ? --- 'Mycoplasma hyopneumoniae 168-L', one very rare species:
> spec <- available.species()
> spec[which(as.numeric(spec$taxon)==1116211),]
taxon species
1031039 1116211 Mycoplasma hyopneumoniae 168-L
I have checked it in AnnotationHub package, it showed no records:
library(AnnotationHub)
> hub <- AnnotationHub()
snapshotDate(): 2016-07-20
> query(hub, c("OrgDb","Mycoplasma hyopneumoniae 168-L"))
AnnotationHub with 0 records
# snapshotDate(): 2016-07-20
Many thanks,
Shisheng
You won't be able to build an OrgDb package for a species that isn't in NCBI's databases:
Hi James,
For the AnnotationHub package, how could I choose the object? For example:
> query(hub, c("OrgDb","Solanum lycopersicum"))
AnnotationHub with 2 records
# snapshotDate(): 2016-07-20
# $dataprovider: NCBI, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Solanum lycopersicum
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description, tags, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH13359"]]'
title
AH13359 | org.Solanum_lycopersicum.eg.sqlite
AH48047 | org.Solanum_lycopersicum.eg.sqlite
It shows me two results: "AH13359" and "AH48047", does it mean "Solanum lycopersicum" has two sub-taxonomies? how could I recogniaze these two? which one should I choose?
Thanks a lot,
Shisheng
Hi James,
Sorry to come back to this rather old thread, but I got stuck when I tried to reproduce your run of the
makeOrgPackageFromNCBI()
function. See below. FYI: I tried to build an OrgDb for Chinese Hamster myselves rather than to use the AnnotationHub because I would like to see the differences between a 'fresh' OrgDb and a slightly dated version (triggered by C: Org.db: why a supposed unique key (ID) has multiple entries?).I got this error:
1st of all: I don't know why 'Ailuropoda' is returned, since I didn't provide this argument..??
Anyway, I checked for CH using
available.species()
and I found two entries having the same TaxID:After viewing the source code of GenomeInfoDb I noticed the error message is triggered by calling the internal function
.getTaxonomyId()
. When I manually run this function I noticed this goes wrong because an "NA" is returned, which then results in printing of the error message.As far as I can understand this is caused by the fact that the taxID 10029 thus matches with 2 descriptors/synonyms (but maybe I am completely wrong!)
Any suggestions on how to get it working? :)
Thanks,
Guido
separate post because of character limit other post.
Update: using a fresh R-session the above-mentioned error persists, even when running makeOrgPackageFromNCBI() without explicitly specifying genus and species. Also strange error on ‘Ailuropoda melanoleuca' still is returned...??
I think on Windows this regular expression https://github.com/Bioconductor-mirror/AnnotationForge/blob/master/R/NCBI_ftp.R#L1498 doesn't strip the '\r' from the end-of-line, so the species has a trailing '\r' -- hence the odd formatting of the 'unknown species' error. Maybe there are other issues, and traceback() after the error would help.
I deleted all files and cache, and started a new session. I copied the last code from James (except for name & email) but few hrs later I still got the same error.... I now also noted that the OP reported the same error. Apparently this seems to be specific for the Windows platform...
Below also the output of traceback().
Another update; yes!, it (almost) worked...
Triggered by Martin's comment I downloaded the source code of AnnotationForge, and modified line 1498 slightly by adding "
\r
":listing<- strsplit(listing, "\r\n")[[1]]
[In addition, I noticed that in line 1492 (here) the ENSEMBL database release is hard-coded/set to be version 80. Since the current version is v85 (see here at the bottom of FTP page), i changed that to 85 (
getFastaSpeciesDirs <- function(release=85){
), but I don't think this caused the error I experienced. Nevertheless, may be good to have this set automagically to the latest version by usingftp://ftp.ensembl.org/pub/current_mysql
?].I then installed from source, and reran
makeOrgPackageFromNCBI()
. Building the OrgDb works fine now. :) However, I installing it did not work yet... (any suggestions on that?"Error : Invalid DESCRIPTION file. Malformed maintainer field."
)So, the problem of failure to build the OrgDb on a Windowns machine seems to be solved by adding
\r
. Whether this change has any impact in Linux I don't know....Here is the error:
Which seems pretty self explanatory, and a quick look at the argument positions
Should have allowed you to self-diagnose.
Thanks James, quite obvious indeed...
All working now:
@ Martin: I noticed you had already fixed the 'end-of-line' issue; would it also be an idea to change the ENSEMBL ftp address, so the latest release will always be used? See my comment above C: problem with makeOrgPackageFromNCBI when making an annotation package.