Search
Question: problem with makeOrgPackageFromNCBI when making an annotation package
1
gravatar for wssdandan2009
7 months ago by
wssdandan20090 wrote:

Hi Marc and Others,

I am trying and learning to use makeOrgPackageFromNCBI() to make organism packages, but always encounter some problems during the process. Therefore, I really hope to get some suggestions and thank you a lot!

Please see the three detailed problems below(Maybe the problems are too many, but really hope to get some hints from you, thank you again.):

1> I run those functions in R version 3.3.1 and Windows 7.

2> As I have download those files needed for the function: gene2pubmed.gz, gene2accession.gz, gene2refseq.gz, gene_info.gz, gene2go.gz, NCBI.sqlite, idmapping_selected.tab.gz, the codes are shown below:

a. The first error-'error in statement: no such table: altGO_date'!

library(AnnotationForge)
library(AnnotationDbi)
library(GenomeInfoDb)
library(biomaRt)

makeOrgPackageFromNCBI(
             version="0.1",
             maintainer="Guido Hooiveld <guido.hooiveld@wur.nl>",
             author="Guido Hooiveld <guido.hooiveld@wur.nl>",
             outputDir=".",
             tax_id='10029',
             genus="Cricetulus",
             species="griseus",
             NCBIFilesDir = ".",
             rebuildCache=F)

preparing data from NCBI ...
starting download for 5 data files
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Error in sqliteSendQuery(con, statement, bind.data) : 
  error in statement: no such table: altGO_date

 

b. The second error. when I set rebuildCache=T, it occurs 'Error in file(description = tmp, open = "r") : object 'tmp' not found'!

makeOrgPackageFromNCBI(
              version="0.1",
              maintainer="Guido Hooiveld <guido.hooiveld@wur.nl>",
              author="Guido Hooiveld <guido.hooiveld@wur.nl>",
              outputDir=".",
              tax_id='10029',
              genus="Cricetulus",
              species="griseus")
If files are not cached locally this may take awhile to assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.
preparing data from NCBI ...
starting download for 5 data files
getting data for gene2pubmed.gz
Error in file(description = tmp, open = "r") : object 'tmp' not found

c. The third error. When I tried some other organism, it occured another problem-'Error in FUN(X[[i]], ...) : ?Please use 'available.species' to see viable species names or tax Ids'!

makeOrgPackageFromNCBI(version = "0.0.1", 
                                   author = "me", 
                                   maintainer = "me <me@mine.org>", 
                                   outputDir = ".", 
                                   tax_id = '7227', 
                                   genus = "Drosophila", 
                                   species = 'Drosophila melanogaster', 
                                   NCBIFilesDir = ".",
                                   rebuildCache=F)

or

makeOrgPackageFromNCBI(version = "0.0.1", 
                                   author = "me", 
                                   maintainer = "me <me@mine.org>", 
                                   outputDir = ".", 
                                   tax_id = '7227', 
                                   genus = "Drosophila", 
                                   species = 'melanogaster', 
                                   NCBIFilesDir = ".",
                                   rebuildCache=F)

Both of them show the same problem:

preparing data from NCBI ...
starting download for 6 data files
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene2unigene
extracting data for our organism from : gene2unigene
getting all data for our organism from : gene2unigene
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Loading required package: httr

Attaching package: httr?
The following object is masked from package:Biobase?

    content

Loading required package: RCurl
Loading required package: bitops
Error in FUN(X[[i]], ...) : 
?Please use 'available.species' to see viable species names or tax Ids

 

Here are my session informations:

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.28.0         GenomeInfoDb_1.8.3     AnnotationForge_1.14.2 AnnotationDbi_1.34.4  
[5] IRanges_2.6.1          S4Vectors_0.10.2       Biobase_2.32.0         BiocGenerics_0.18.0   

loaded via a namespace (and not attached):
[1] rsconnect_0.4.3 DBI_0.4-1       tools_3.3.1     RCurl_1.95-4.8  RSQLite_1.0.0   bitops_1.0-6   
[7] XML_3.98-1.4   

 

Looking forward to your response~~

 

Thanks,

Shisheng

ADD COMMENTlink modified 10 weeks ago by theobroma2210 • written 7 months ago by wssdandan20090
0
gravatar for James W. MacDonald
7 months ago by
United States
James W. MacDonald42k wrote:

First, note that the author and maintainer should by all rights be you, not Guido Hooiveld, nor the fake person that I often use in examples (me, and me@mine.org).

Second, note that you have an aberrant Bioconductor installation. Those are not the correct packages for the R version you are using, and before we go any further you need to do

source("https://www.bioconductor.org/biocLite.R")
biocLite(ask = FALSE)

To correct that issue. Then, if the problems persist, please let us know.

ADD COMMENTlink written 7 months ago by James W. MacDonald42k
0
gravatar for wssdandan2009
7 months ago by
wssdandan20090 wrote:

Hi James,

Thank you very much for your response. I have done as you suggested, I ran my Rstudio as administrator. However, it still occured the same problem:

 

>source("https://bioconductor.org/biocLite.R")
Bioconductor version 3.3 (BiocInstaller 1.22.3), ?biocLite for help

 

>biocLite(ask = FALSE) #Here, I also tried biocLite("AnnotationForge",ask = FALSE) to reinstall the package, but                     #it didn't work.
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.3 (BiocInstaller 1.22.3), R 3.3.1 (2016-06-21).

 

>makeOrgPackageFromNCBI(version = "0.0.1",

                                    author = "wss",

                                    maintainer = "wss <wssdandan2009@outlook.com>",

                                    outputDir = ".",

                                    tax_id = '7227',

                                    genus = "Drosophila",

                                    species = "Drosophila melanogaster",

                                    NCBIFilesDir = ".",

                                    rebuildCache=F)

preparing data from NCBI ...

starting download for 6 data files

getting data for gene2pubmed.gz

extracting data for our organism from : gene2pubmed

getting data for gene2accession.gz

extracting data for our organism from : gene2accession

getting data for gene2refseq.gz

extracting data for our organism from : gene2refseq

getting data for gene2unigene

extracting data for our organism from : gene2unigene

getting all data for our organism from : gene2unigene

getting data for gene_info.gz

extracting data for our organism from : gene_info

getting data for gene2go.gz

extracting data for our organism from : gene2go

processing gene2pubmed

processing gene_info: chromosomes

processing gene_info: description

processing alias data

processing refseq data

processing accession data

processing GO data

Loading required package: httr

 

Attaching package: httr?

The following object is masked from package:Biobase?

 

    content

 

Loading required package: RCurl

Loading required package: bitops

Error in FUN(X[[i]], ...) :

?Please use 'available.species' to see viable species names or tax Ids

 

Shisheng

ADD COMMENTlink written 7 months ago by wssdandan20090

If you want to respond, use the ADD COMMENT button and type in the box that comes up. If you use the Add your answer box, it looks like you are answering your own question, which you are not doing.

As Marc pointed out, you can simply use the OrgDb on AnnotationHub.

> library(AnnotationHub)

> hub <- AnnotationHub()
updating metadata: retrieving 1 resource
  |======================================================================| 100%
snapshotDate(): 2016-07-20
> query(hub, c("OrgDb","Drosophila melanogaster"))
AnnotationHub with 1 record
# snapshotDate(): 2016-07-20
# names(): AH49581
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Drosophila melanogaster
# $rdataclass: OrgDb
# $title: org.Dm.eg.db.sqlite
# $description: NCBI gene ID based annotations about Drosophila melanogaster
# $taxonomyid: 7227
# $genome: NCBI genomes
# $sourcetype: NCBI/ensembl
# $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.ensembl.org/p...
# $sourcelastmodifieddate: NA
# $sourcesize: NA
# $tags: NCBI, Gene, Annotation
# retrieve record with 'object[["AH49581"]]'
> dm <- hub[["AH49581"]]
downloading from 'https://annotationhub.bioconductor.org/fetch/56311'
retrieving 1 resource
  |======================================================================| 100%

> dm
OrgDb object:
| DBSCHEMAVERSION: 2.1
| Db type: OrgDb
| Supporting package: AnnotationDbi
| DBSCHEMA: FLY_DB
| ORGANISM: Drosophila melanogaster
| SPECIES: Fly
| EGSOURCEDATE: 2015-Aug11
| EGSOURCENAME: Entrez Gene
| EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| CENTRALID: EG
| TAXID: 7227
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
| GOSOURCEDATE: 20150808
| GOEGSOURCEDATE: 2015-Aug11
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| KEGGSOURCENAME: KEGG GENOME
| KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
| KEGGSOURCEDATE: 2011-Mar15
| GPSOURCENAME: UCSC Genome Bioinformatics (Drosophila melanogaster)
| GPSOURCEURL: ftp://hgdownload.cse.ucsc.edu/goldenPath/dm6
| GPSOURCEDATE: 2014-Dec12
| FBSOURCEDATE: -Jan08
| FBSOURCENAME: Flybase
| FBSOURCEURL: ftp://ftp.flybase.net/releases/current/precomputed_files/genes/
| ENSOURCEDATE: 2015-Jul16
| ENSOURCENAME: Ensembl
| ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta

Please see: help('select') for usage information

As to your error, the help page for that function says

   genus: Single string indicating the genus.

  species: Single string indicating the species.

And the species in this situation is "melanogaster", not "Drosophila melanogaster", which is the genus and species.

 

ADD REPLYlink written 7 months ago by James W. MacDonald42k

Hi James,

Thank you for your warning and answer. This is my first time to use the website Bioconductor, I will note that next time!

For my question, first, I know I could find the Drosophila OrgDb by 'AnnotationHub', I just want to try and learn the function 'makeOrgPackageFromNCBI' to see whether it works in my computer^_^;

Second, I have seen the help page for that function and tried to only use 'melanogaster' for the 'species', but nothing could help, it still occured the same problem. What's more, I checked the 'available.species':

> spec <- available.species()
> spec[which(as.numeric(spec$taxon)==7227),]
      taxon                 species
10836  7227 Drosophila melanogaster

As you can see, it shows me the 'species'-'Drosophila melanogaster'. I even tried 'Drosophila_melanogaster' or 'Drosophilamelanogaster', but the problem is always there;

Third, 'Drosophila melanogaster' is just an example, which is not my objective organism. As the above, I posted three problems for trying different examples (none of them is my studying object) in my computer, I just want to learn this awesome function for my future research.

Therefore, I really need your help to fix the three problems of the function 'makeOrgPackageFromNCBI' in my computer. Please do not advise me to give up the function...

Thank you quite a lot^_^

 

Shisheng

ADD REPLYlink written 7 months ago by wssdandan20090

Well, you don't say what this mysterious species is, but if I assume it's Cricetulus griseus, then

> makeOrgPackageFromNCBI("0.0.1", "me@mine.org", "me",tax_id="10029", genus="Cricetulus",species="griseus")
If files are not cached locally this may take awhile to assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.
preparing data from NCBI ...
starting download for 5 data files
getting data for gene2pubmed.gz
rebuilding the cache
Loading required package: RCurl
Loading required package: bitops
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
rebuilding the cache
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
rebuilding the cache
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
rebuilding the cache
extracting data for our organism from : gene_info
getting data for gene2go.gz
rebuilding the cache
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Loading required package: biomaRt
Loading required package: httr

Attaching package:  httr

The following object is masked from  package:Biobase :

    content

Please be patient while we work out which organisms can be annotated
  with ensembl IDs.
making the OrgDb package ...
Loading required package: RSQLite
Loading required package: DBI
Populating genes table:
genes table filled
Populating pubmed table:
pubmed table filled
Populating chromosomes table:
chromosomes table filled
Populating gene_info table:
gene_info table filled
Populating entrez_genes table:
entrez_genes table filled
Populating alias table:
alias table filled
Populating refseq table:
refseq table filled
Populating accessions table:
accessions table filled
Populating go table:
go table filled
table metadata filled
Loading required package: GO.db

'select()' returned many:1 mapping between keys and columns
Dropping GO IDs that are too new for the current GO.db
Populating go table:
go table filled
'select()' returned many:1 mapping between keys and columns
Populating go_all table:
go_all table filled
Creating package in /misc/jmacdon/org.Cgriseus.eg.db
Now deleting temporary database file
complete!
[1] "org.Cgriseus.eg.sqlite"
> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C             
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8   
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8  
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C                
[9] LC_ADDRESS=C               LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
[1] GO.db_3.3.0            RSQLite_1.0.0          DBI_0.4-1            
[4] httr_1.2.1             biomaRt_2.28.0         RCurl_1.95-4.8       
[7] bitops_1.0-6           AnnotationForge_1.14.2 AnnotationDbi_1.34.4 
[10] IRanges_2.6.1          S4Vectors_0.10.2       Biobase_2.32.0       
[13] BiocGenerics_0.18.0   

loaded via a namespace (and not attached):
[1] XML_3.98-1.4       GenomeInfoDb_1.8.3 R6_2.1.2           tools_3.3.0      
[5] compiler_3.3.0   
> library(AnnotationHub)

Attaching package:  AnnotationHub

The following object is masked from  package:Biobase :

    cache

OR, as Marc already pointed out, there are literally (yes, literally!) thousands of species in the AnnotationHub, this one being represented twice.

> hub <- AnnotationHub()
updating metadata: retrieving 1 resource
  |======================================================================| 100%
snapshotDate(): 2016-07-20
> grep(hub, c("OrgDb","Cricetulus griseus"))
Error in as.character.default(pattern) :
  no method for coercing this S4 class to a vector
> query(hub, c("OrgDb","Cricetulus griseus"))
AnnotationHub with 2 records
# snapshotDate(): 2016-07-20
# $dataprovider: NCBI, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Cricetulus griseus
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype
# retrieve records with, e.g., 'object[["AH12820"]]'

            title                          
  AH12820 | org.Cricetulus_griseus.eg.sqlite
  AH48061 | org.Cricetulus_griseus.eg.sqlite
>
ADD REPLYlink written 7 months ago by James W. MacDonald42k

Hi James,

So strange for the problem. Well, it is OK for your computer, not for mine. And could you help me check whether there is something for my studying object by using 'makeOrgPackageFromNCBI' function ? --- 'Mycoplasma hyopneumoniae 168-L', one very rare species:

> spec <- available.species()

> spec[which(as.numeric(spec$taxon)==1116211),]

          taxon                        species

1031039 1116211 Mycoplasma hyopneumoniae 168-L

I have checked it in AnnotationHub package, it showed no records:

library(AnnotationHub)

> hub <- AnnotationHub()

snapshotDate(): 2016-07-20

> query(hub, c("OrgDb","Mycoplasma hyopneumoniae 168-L"))

AnnotationHub with 0 records

# snapshotDate(): 2016-07-20

Many thanks,

 

Shisheng

ADD REPLYlink written 7 months ago by wssdandan20090

You won't be able to build an OrgDb package for a species that isn't in NCBI's databases:

zcat gene2accession.gz | cut -f 1 | grep -w 1116211 | wc -l
0

 

 

ADD REPLYlink written 7 months ago by James W. MacDonald42k

Hi James,

For the AnnotationHub package, how could I choose the object? For example:

> query(hub, c("OrgDb","Solanum lycopersicum"))
AnnotationHub with 2 records
# snapshotDate(): 2016-07-20 
# $dataprovider: NCBI, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Solanum lycopersicum
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description, tags, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH13359"]]' 

            title                             
  AH13359 | org.Solanum_lycopersicum.eg.sqlite
  AH48047 | org.Solanum_lycopersicum.eg.sqlite

It shows me two results: "AH13359" and "AH48047", does it mean "Solanum lycopersicum" has two sub-taxonomies? how could I recogniaze these two? which one should I choose?

 

Thanks a lot,

Shisheng

ADD REPLYlink written 7 months ago by wssdandan20090

Hi James,

Sorry to come back to this rather old thread, but I got stuck when I tried to reproduce your run of the makeOrgPackageFromNCBI() function. See below. FYI: I tried to build an OrgDb for Chinese Hamster myselves rather than to use the AnnotationHub because I would like to see the differences between a 'fresh' OrgDb and a slightly dated version (triggered by C: Org.db: why a supposed unique key (ID) has multiple entries?).

I got this error:

Error in FUN(X[[i]], ...) :
  1 unknown species: ‘Ailuropoda melanoleuca
’ Please use 'available.species' to see viable species names or tax Ids
>

1st of all: I don't know why 'Ailuropoda' is returned, since I didn't provide this argument..??

Anyway, I checked for CH using available.species() and I found two entries having the same TaxID:

> spec <- available.species()
> spec[grepl('griseus',spec$species),]
        taxon                             species
<<snip>>
13907     10029                                                Cricetulus barabensis griseus
13908     10029                                                           Cricetulus griseus
<<snip>>
>

After viewing the source code of GenomeInfoDb I noticed the error message is triggered by calling the internal function .getTaxonomyId(). When I manually run this function I noticed this goes wrong because an "NA" is returned, which then results in printing of the error message.

> data(speciesMap, package="GenomeInfoDb")
> species="griseus"
>     species <- gsub(" {2,}", " ", species)
>     species <- gsub(",", " ", species, fixed=TRUE)
>     idx <- match(species, speciesMap$species)
> idx
[1] NA
>

As far as I can understand this is caused by the fact that the taxID 10029 thus matches with 2 descriptors/synonyms (but maybe I am completely wrong!)

Any suggestions on how to get it working? :)

Thanks,

Guido

>
> library(AnnotationForge)
> library(AnnotationDbi)
> library(GenomeInfoDb)
>
> makeOrgPackageFromNCBI("0.0.1", "guido.hooiveld@wur.nl", "Guido Hooiveld",tax_id="10029", genus="Cricetulus",species="griseus", rebuildCache=FALSE)
preparing data from NCBI ...
starting download for 5 data files
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Loading required package: biomaRt
Loading required package: httr

Attaching package: ‘httr’

The following object is masked from ‘package:Biobase’:

    content

Loading required package: RCurl
Loading required package: bitops
Error in FUN(X[[i]], ...) :
  1 unknown species: ‘Ailuropoda melanoleuca
’ Please use 'available.species' to see viable species names or tax Ids
>

>    dir()
[1] "gene_info.gz"              "gene2accession.gz"        
[3] "gene2go.gz"                "gene2pubmed.gz"           
[5] "gene2refseq.gz"            "idmapping_selected.tab.gz"
[7] "NCBI.sqlite"              
>

ADD REPLYlink modified 6 months ago • written 6 months ago by Guido Hooiveld1.9k
> sessionInfo()
R version 3.3.1 Patched (2016-06-28 r70853)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] RCurl_1.95-4.8         bitops_1.0-6           httr_1.2.1            
 [4] biomaRt_2.28.0         GenomeInfoDb_1.8.7     AnnotationForge_1.14.2
 [7] AnnotationDbi_1.34.4   IRanges_2.6.1          S4Vectors_0.10.3      
[10] Biobase_2.32.0         BiocGenerics_0.18.0   

loaded via a namespace (and not attached):
[1] XML_3.98-1.4  R6_2.1.3      DBI_0.5       RSQLite_1.0.0 tools_3.3.1  
>


separate post because of character limit other post.

ADD REPLYlink written 6 months ago by Guido Hooiveld1.9k

Update: using a fresh R-session the above-mentioned error persists, even when running makeOrgPackageFromNCBI() without explicitly specifying genus and species. Also strange error on ‘Ailuropoda melanoleuca' still is returned...??

 

> library(AnnotationForge)
> library(AnnotationDbi)
> library(GenomeInfoDb)
> makeOrgPackageFromNCBI("0.0.1", "guido.hooiveld@wur.nl", "Guido Hooiveld",tax_id="10029", rebuildCache=FALSE)
preparing data from NCBI ...
starting download for 5 data files
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Loading required package: biomaRt
Loading required package: httr

Attaching package: ‘httr’

The following object is masked from ‘package:Biobase’:

    content

Loading required package: RCurl
Loading required package: bitops
Error in FUN(X[[i]], ...) :
  1 unknown species: ‘Ailuropoda melanoleuca
’ Please use 'available.species' to see viable species names or tax Ids
>

 

 

ADD REPLYlink written 6 months ago by Guido Hooiveld1.9k
> makeOrgPackageFromNCBI("0.0.1", "me","me@mine.org", ".", "10029", "Cricetulus","griseus")
If files are not cached locally this may take awhile to assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.
preparing data from NCBI ...
starting download for 5 data files
getting data for gene2pubmed.gz
rebuilding the cache
Loading required package: RCurl
Loading required package: bitops
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
rebuilding the cache
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
rebuilding the cache
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
rebuilding the cache
extracting data for our organism from : gene_info
getting data for gene2go.gz
rebuilding the cache
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Loading required package: biomaRt
Loading required package: httr

Attaching package: ‘httr’

The following object is masked from ‘package:Biobase’:

    content

Please be patient while we work out which organisms can be annotated
  with ensembl IDs.
making the OrgDb package ...
Loading required package: RSQLite
Loading required package: DBI
Populating genes table:
genes table filled
Populating pubmed table:
pubmed table filled
Populating chromosomes table:
chromosomes table filled
Populating gene_info table:
gene_info table filled
Populating entrez_genes table:
entrez_genes table filled
Populating alias table:
alias table filled
Populating refseq table:
refseq table filled
Populating accessions table:
accessions table filled
Populating go table:
go table filled
table metadata filled
Loading required package: GO.db

'select()' returned many:1 mapping between keys and columns
Dropping GO IDs that are too new for the current GO.db
Populating go table:
go table filled
'select()' returned many:1 mapping between keys and columns
Populating go_all table:
go_all table filled
Creating package in ./org.Cgriseus.eg.db
Now deleting temporary database file
complete!
[1] "org.Cgriseus.eg.sqlite"
> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] GO.db_3.3.0            RSQLite_1.0.0          DBI_0.4-1
 [4] httr_1.2.1             biomaRt_2.28.0         RCurl_1.95-4.8
 [7] bitops_1.0-6           AnnotationForge_1.14.2 AnnotationDbi_1.34.4
[10] IRanges_2.6.1          S4Vectors_0.10.2       Biobase_2.32.0
[13] BiocGenerics_0.18.0

loaded via a namespace (and not attached):
[1] XML_3.98-1.4       GenomeInfoDb_1.8.3 R6_2.1.2           tools_3.3.0

 

ADD REPLYlink written 6 months ago by James W. MacDonald42k

I think on Windows this regular expression https://github.com/Bioconductor-mirror/AnnotationForge/blob/master/R/NCBI_ftp.R#L1498 doesn't strip the '\r' from the end-of-line, so the species has a trailing '\r' -- hence the odd formatting of the 'unknown species' error. Maybe there are other issues, and traceback() after the error would help.

ADD REPLYlink written 6 months ago by Martin Morgan ♦♦ 19k

I deleted all files and cache, and started a new session. I copied the last code from James (except for name & email) but few hrs later I still got the same error.... I now also noted that the OP reported the same error. Apparently this seems to be specific for the Windows platform...

Below also the output of traceback().

> makeOrgPackageFromNCBI("0.0.1", "Guido Hooiveld","guido.hooiveld@wur.nl", ".", "10029", "Cricetulus","griseus")

Error in FUN(X[[i]], ...) :
  1 unknown species: ‘Ailuropoda melanoleuca
’ Please use 'available.species' to see viable species names or tax Ids
> traceback()
16: stop(sum(is.na(idx)), " unknown species: ", paste(sQuote(head(species[is.na(idx)])),
        "Please use 'available.species' to see viable species names or tax Ids",
        collapse = " "))
15: FUN(X[[i]], ...)
14: lapply(species, .getTaxonomyId)
13: lapply(species, .getTaxonomyId)
12: unlist(lapply(species, .getTaxonomyId))
11: FUN(X[[i]], ...)
10: lapply(specNames, GenomeInfoDb:::.taxonomyId)
9: lapply(specNames, GenomeInfoDb:::.taxonomyId)
8: unlist(lapply(specNames, GenomeInfoDb:::.taxonomyId))
7: getFastaSpeciesDirs()
6: available.FastaEnsemblSpecies()
5: available.ensembl.datasets()
4: tax_id %in% names(available.ensembl.datasets())
3: prepareDataFromNCBI(tax_id, NCBIFilesDir, outputDir, rebuildCache,
       verbose)
2: NEW_makeOrgPackageFromNCBI(version, maintainer, author, outputDir,
       tax_id, genus, species, NCBIFilesDir, databaseOnly, rebuildCache = rebuildCache,
       verbose = verbose)
1: makeOrgPackageFromNCBI("0.0.1", "Guido Hooiveld", "guido.hooiveld@wur.nl",
       ".", "10029", "Cricetulus", "griseus")
>


 

ADD REPLYlink modified 6 months ago • written 6 months ago by Guido Hooiveld1.9k

Another update; yes!, it (almost) worked...

Triggered by Martin's comment I downloaded the source code of AnnotationForge, and modified line 1498 slightly by adding "\r":

listing<- strsplit(listing, "\r\n")[[1]]

(was: listing<- strsplit(listing, "\n")[[1]])

[In addition, I noticed that in line 1492 (here) the ENSEMBL database release is hard-coded/set to be version 80. Since the current version is v85 (see here at the bottom of FTP page), i changed that to 85 ( getFastaSpeciesDirs <- function(release=85){  ), but I don't think this caused the error I experienced. Nevertheless, may be good to have this set automagically to the latest version by using ftp://ftp.ensembl.org/pub/current_mysql ?].

 

I then installed from source, and reran makeOrgPackageFromNCBI(). Building the OrgDb works fine now. :) However, I installing it did not work yet... (any suggestions on that? "Error : Invalid DESCRIPTION file. Malformed maintainer field.")

So, the problem of failure to build the OrgDb on a Windowns machine seems to be solved by adding \r. Whether this change has any impact in Linux I don't know....

 

>
> makeOrgPackageFromNCBI("0.0.1", "Guido Hooiveld","guido.hooiveld@wur.nl", ".", "10029", "Cricetulus","griseus")
If files are not cached locally this may take awhile to assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.
preparing data from NCBI ...
starting download for 5 data files
<<snip>>
processing GO data
Loading required package: RCurl
Loading required package: bitops
Loading required package: biomaRt
Loading required package: httr

Attaching package: ‘httr’

The following object is masked from ‘package:Biobase’:

    content

Please be patient while we work out which organisms can be annotated
  with ensembl IDs.
making the OrgDb package ...
Loading required package: RSQLite
Loading required package: DBI
Populating genes table:
genes table filled
Populating pubmed table:
pubmed table filled
Populating chromosomes table:
chromosomes table filled
Populating gene_info table:
gene_info table filled
Populating entrez_genes table:
entrez_genes table filled
Populating alias table:
alias table filled
Populating refseq table:
refseq table filled
Populating accessions table:
accessions table filled
Populating go table:
go table filled
table metadata filled
Loading required package: GO.db

'select()' returned many:1 mapping between keys and columns
Dropping GO IDs that are too new for the current GO.db
Populating go table:
go table filled
'select()' returned many:1 mapping between keys and columns
Populating go_all table:
go_all table filled
Creating package in ./org.Cgriseus.eg.db
Now deleting temporary database file
complete!
[1] "org.Cgriseus.eg.sqlite"
Warning message:
In file.remove(dbFileName) :
  cannot remove file './org.Cgriseus.eg.sqlite', reason 'Permission denied'
>

> install.packages(pkgs="org.Cgriseus.eg.db", repos = NULL, type="source")
* installing *source* package 'org.Cgriseus.eg.db' ...
Error : Invalid DESCRIPTION file

Malformed maintainer field.

See section 'The DESCRIPTION file' in the 'Writing R Extensions'
manual.

ERROR: installing package DESCRIPTION failed for package 'org.Cgriseus.eg.db'
* removing 'C:/Program Files/R/R-3.3.1patched/library/org.Cgriseus.eg.db'
Warning messages:
1: running command '"C:/PROGRA~1/R/R-33~1.1PA/bin/x64/R" CMD INSTALL -l "C:\Program Files\R\R-3.3.1patched\library" "org.Cgriseus.eg.db"' had status 1
2: In install.packages(pkgs = "org.Cgriseus.eg.db", repos = NULL, type = "source") :
  installation of package ‘org.Cgriseus.eg.db’ had non-zero exit status

 

 

ADD REPLYlink modified 6 months ago • written 6 months ago by Guido Hooiveld1.9k

Here is the error:

Error : Invalid DESCRIPTION file

Malformed maintainer field.

See section 'The DESCRIPTION file' in the 'Writing R Extensions'
manual.

Which seems pretty self explanatory, and a quick look at the argument positions

> args(makeOrgPackageFromNCBI)
function (version, maintainer, author, outputDir = getwd(), tax_id,
    genus = NULL, species = NULL, NCBIFilesDir = getwd(), databaseOnly = FALSE,
    useDeprecatedStyle = FALSE, rebuildCache = TRUE, verbose = TRUE)

Should have allowed you to self-diagnose.

ADD REPLYlink written 6 months ago by James W. MacDonald42k

Thanks James, quite obvious indeed...

All working now:

> makeOrgPackageFromNCBI(
+ version="0.0.1",
+ author = "Guido Hooiveld <guido.hooiveld@wur.nl>",
+ maintainer = " Guido Hooiveld <guido.hooiveld@wur.nl>",
+ ".", tax_id = "10029", genus = "Cricetulus", species= "griseus")
If files are not cached locally this may take awhile to assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.
preparing data from NCBI ...

<<snip>>

>
> install.packages(pkgs="./org.Cgriseus.eg.db", repos=NULL, type="source")
* installing *source* package 'org.Cgriseus.eg.db' ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
*** arch - i386
*** arch - x64
* DONE (org.Cgriseus.eg.db)
>

 

@ Martin: I noticed you had already fixed the 'end-of-line' issue; would it also be an idea to change the ENSEMBL ftp address, so the latest release will always be used? See my comment above C: problem with makeOrgPackageFromNCBI when making an annotation package.

 

ADD REPLYlink modified 6 months ago • written 6 months ago by Guido Hooiveld1.9k
0
gravatar for Marc Carlson
7 months ago by
Marc Carlson7.2k
United States
Marc Carlson7.2k wrote:

Hi Shisheng,

Can I ask why it is that you want to call this function instead of just using the OrgDb objects that have been pre-built and loaded into AnnotationHub?

 

 Marc

ADD COMMENTlink written 7 months ago by Marc Carlson7.2k

Hi Marc,

As what I replied to James above,  'Drosophila melanogaster' is just an example, which is not my objective organism. As the above, I posted three problems for trying different examples (none of them is my studying object) in my computer, I just want to learn this awesome function for my future research.

Just for learning^_^

 

Shisheng

ADD REPLYlink written 7 months ago by wssdandan20090
0
gravatar for theobroma22
10 weeks ago by
theobroma2210
theobroma2210 wrote:

Hi guys,

I'm having similar problems with regard to this post, and I'm using a Windows computer...I haven't hacked any code yet like Guido did since this post was active about three months ago and this package was already updated particularly for the '\r' issue in the species argument of makeOrgPackageFromNCBI function. 

My problem is that the gene2accession file download doesn't complete. After R is Not Responding for awhile it then times out, and the error tells me to try again later! However, the gene2accession file icon does appear in the directory but the file size is ultimately 0 KB.  

The NCBIsqlite (429,179 KB) and gene2 pubmed (31,486 KB) files do download to the working directory.

I tried making the org package from NCBI about three dozen times only to get the same result because I initially thought it was connectivity issues based on the error...after reading this post it may be more serious than the error I'm getting.

     makeOrgPackageFromNCBI(version = "0.1",
                       author = "Franklin Johnson <franklin.johnson@amway.com>",
                       maintainer = "Franklin Johnson <franklin.johnson@amway.com>",
                       outputDir = getwd(),
                       tax_id = "3749",
                       genus = "Malus",
                       species = "",
                       NCBIFilesDir = getwd(),
                       databaseOnly = FALSE,
                       rebuildCache = TRUE,
                       verbose = TRUE)

Even if I use tax_id= "3750", genus = "Malus" and species = "domestica" I have the same issue.

Any help on this would be greatly appreciated!! 

Thanks!

Franklin

 

  

ADD COMMENTlink written 10 weeks ago by theobroma2210

It is not ideal to add onto a months-old thread. Instead you should consider starting a new thread, even if your question is related. In addition, the box below, labeled 'Add your answer' is not intended for you to add another question, nor to add a comment; as the title suggests, it's for adding answers.

To answer your question, there is an already built OrgDb for Malus domestica on the AnnotationHub that you should use rather than trying to build your own:

> library(AnnotationHub)
<snip>
> hub <- AnnotationHub()
updating metadata: retrieving 1 resource
  |======================================================================| 100%
snapshotDate(): 2016-10-11
> query(hub, c("Malus domestica","OrgDb"))
AnnotationHub with 1 record
# snapshotDate(): 2016-10-11
# names(): AH52107
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Malus domestica
# $rdataclass: OrgDb
# $title: org.Malus_domestica.eg.sqlite
# $description: NCBI gene ID based annotations about Malus domestica
# $taxonomyid: 3750
# $genome: NCBI genomes
# $sourcetype: NCBI/UniProt
# $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.uniprot.org/p...
# $sourcelastmodifieddate: NA
# $sourcesize: NA
# $tags: c("NCBI", "Gene", "Annotation")
# retrieve record with 'object[["AH52107"]]'
> malus <- hub[["AH52107"]]
downloading from  https://annotationhub.bioconductor.org/fetch/58845
retrieving 1 resource
  |======================================================================| 100%
Loading required package: AnnotationDbi
Loading required package: stats4
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Attaching package:  Biobase

The following object is masked from  package:AnnotationHub :

    cache

Loading required package: IRanges
Loading required package: S4Vectors

Attaching package:  S4Vectors

The following objects are masked from  package:base :

    colMeans, colSums, expand.grid, rowMeans, rowSums

> malus
OrgDb object:
| DBSCHEMAVERSION: 2.1
| DBSCHEMA: NOSCHEMA_DB
| ORGANISM: Malus domestica
| SPECIES: Malus domestica
| CENTRALID: GID
| Taxonomy ID: 3750
| Db type: OrgDb
| Supporting package: AnnotationDbi

Please see: help('select') for usage information

 

ADD REPLYlink written 10 weeks ago by James W. MacDonald42k
0
gravatar for theobroma22
10 weeks ago by
theobroma2210
theobroma2210 wrote:

Thanks James. I thought to start a new thread, but thought the responding person would paste the link to this thread as an answer. So, I posted here to 'request more detail.' Anyhow, yes, I used AnnotationHub, but when I went to makeTxDbFromGranges my object 

md_blast2go <- md[["AH13353"]] is not a GRange object. I tried to make md_blast2go into a GRange object using the GenomicRanges package but it cannot convert this type of object to a GRange object.

Are you able to makeTxDbFromGranges with your 'malus' object? As such, I'm only as far as you are in AnnotationHub, and I looked at the more than 56K entrez ids, proteins, ..., available info using 'select' functions columns and keytypes. 

Thanks,

Franklin 

ADD COMMENTlink written 10 weeks ago by theobroma2210
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 151 users visited in the last hour