Question

Error during makeOrgPackageFromNCBI

0

Entering edit mode

oliver.tills • 0

@olivertills-7274

Last seen 9.8 years ago

European Union

Hi,

I am trying to run makeOrgPackageFromNCBI, as follows -

> makeOrgPackageFromNCBI(version="0.1",
+                         author="otills <*******@*******>",
+                         maintainer="otills <*******@*******>",
+                         outputDir=".",
+                         tax_id="582868",
+                         genus="Mollusca",
+                         species="Radix balthica")

However, after approx 30 mins I get the error -

Getting data for gene2pubmed.gz
Loading required package: RCurl
Loading required package: bitops
extracting only data for our organism from : gene2pubmed
Getting data for gene2accession.gz
Error in sqliteSendQuery(con, statement, bind.data) : 
  error in statement: duplicate column name: NA

I've tried re-running (including on different machines), but I get this error consistently. The file sizes (gene2accession.gz, gene2pubmed.gz and NCBI.sqlite) are always the same size at the time of crash.

Can anyone suggest what the problem might be?

Oli

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] RCurl_1.95-4.5        bitops_1.0-6          AnnotationForge_1.8.1 org.Hs.eg.db_3.0.0    RSQLite_1.0.0        
 [6] DBI_0.3.1             AnnotationDbi_1.28.1  GenomeInfoDb_1.2.4    IRanges_2.0.1         S4Vectors_0.4.0      
[11] Biobase_2.26.0        BiocGenerics_0.12.1   BiocInstaller_1.16.1 

loaded via a namespace (and not attached):
[1] tools_3.1.2
>

annotation annotationforge annotationdbi makeorgpackagefromncbi • 2.7k views

ADD COMMENT • link 11.1 years ago • updated 11.0 years ago oliver.tills • 0

score 0 · Answer 1 · 2015-01-21

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 17 hours ago

United States

The problem is that your species isn't represented in the gene2accession.gz file:

jmacdon$ zcat /data/tmp2/gene2accession.gz | awk '{if($1 == 582868) print $0}' | wc -l
0

So you won't be able to create the package using this pipeline.

ADD COMMENT • link 11.1 years ago James W. MacDonald 68k

score 0 · Answer 2 · 2015-01-22

0

Entering edit mode

oliver.tills • 0

@olivertills-7274

Last seen 9.8 years ago

European Union

James, thanks for your reply. I don't quite understand the processes involved in this program, however I tried running the same query for Drosophila melanogaster and I received exactly the same error, which surprised me, can this pipeline also not be used for Drosophila?

> makeOrgPackageFromNCBI(version="0.1",
+                         author="otills <oliver.tills@plymouth.ac.uk>",
+                         maintainer="otills <oliver.tills@plymouth.ac.uk>",
+                         outputDir=".",
+                         tax_id="7227",
+                         genus="drosophila",
+                         species="drosophila_melanogaster",
+                         NCBIFilesDir=".")
If this is the 1st time you have run this function, it may take a long time (over an hour) to download needed files and assemble a 12 GB cache databse in the NCBIFilesDir directory.  Subsequent calls to this function should be faster (seconds).  The cache will try to rebuild once per day.
Getting data for gene2pubmed.gz
extracting only data for our organism from : gene2pubmed
Getting data for gene2accession.gz
Error in sqliteSendQuery(con, statement, bind.data) : 
  error in statement: duplicate column name: NA

ADD COMMENT • link 11.1 years ago oliver.tills • 0

0

Entering edit mode

Hi Oliver,

It turns out that NCBI has changed the format of the gene2accession and gene2refseq files by adding three additional columns. Since the code that parses these files expected three fewer columns, the result is as you see. After making some small changes in the underlying code in the Devel branch, I get things to work:

> makeOrgPackageFromNCBI(version = "0.0.1", author = "me", maintainer = "me <me@mine.org>", outputDir = ".", tax_id = "7227", genus = "Drosophila", species = "melanogaster", NCBIFilesDir = ".")
If this is the 1st time you have run this function, it may take a long time (over an hour) to download needed files and assemble a 12 GB cache databse in the NCBIFilesDir directory.  Subsequent calls to this function should be faster (seconds).  The cache will try to rebuild once per day.
Getting data for gene2pubmed.gz
extracting only data for our organism from : gene2pubmed
Getting data for gene2accession.gz
extracting only data for our organism from : gene2accession
Getting data for gene2refseq.gz
extracting only data for our organism from : gene2refseq
Getting data for gene2unigene
Loading required package: RCurl
Loading required package: bitops
extracting only data for our organism from : gene2unigene
getting all data for our organism from : gene2unigene
Getting data for gene_info.gz
extracting only data for our organism from : gene_info
Getting data for gene2go.gz
extracting only data for our organism from : gene2go
Populating genes table:
genes table filled
Populating pubmed table:
pubmed table filled
Populating chromosomes table:
chromosomes table filled
Populating gene_info table:
gene_info table filled
Populating entrez_genes table:
entrez_genes table filled
Populating alias table:
alias table filled
Populating refseq table:
refseq table filled
Populating accessions table:
accessions table filled
Populating go table:
go table filled
Populating unigene table:
unigene table filled
table metadata filled
Loading required package: GO.db

Dropping GO IDs that are too new for the current GO.db
Populating go table:
go table filled
Populating go_all table:
go_all table filled
Creating package in ./org.Dmelanogaster.eg.db
Now deleting temporary database file
[1] "org.Dmelanogaster.eg.sqlite"

> install.packages("org.Dmelanogaster.eg.db/", repos=NULL)
* installing *source* package ‘org.Dmelanogaster.eg.db’ ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (org.Dmelanogaster.eg.db)

Also note that the genus is "Drosophila", and the species is "melanogaster", and the maintainer has to be something like "me <me@mine.org>" with the brackets and all that, or the package won't install correctly.

I'll send a patch to Marc Carlson, who is the maintainer for AnnotationForge, and hopefully we will get an updated version pushed in the next day or so.

In the meantime, if you are impatient, you can download the source package and change the function .primaryFiles() in NCBI_ftp.R to be like this:

.primaryFiles <- function(){
    list(
         "gene2pubmed.gz" = c("tax_id","gene_id", "pubmed_id"),
         "gene2accession.gz" = c("tax_id","gene_id","status","rna_accession",
           "rna_gi","protein_accession","protein_gi","genomic_dna_accession",
           "genomic_dna_gi","genomic_start","genomic_end","orientation",
           "assembly","peptide_accession","peptide_gi","symbol"),
         ## This one might be needed later
         "gene2refseq.gz" = c("tax_id","gene_id","status","rna_accession",
           "rna_gi","protein_accession","protein_gi","genomic_dna_accession",
           "genomic_dna_gi","genomic_start","genomic_end","orientation",
           "assembly","peptide_accession","peptide_gi","symbol"),
         "gene2unigene" = c("gene_id","unigene_id"),
         "gene_info.gz" = c("tax_id","gene_id","symbol","locus_tag",
           "synonyms","dbXrefs","chromosome","map_location","description",
           "gene_type","nomenclature_symbol","nomenclature_name",
           "nomenclature_status","other_designations", "modification_date"),
         ##        "mim2gene.gz" = c("mim_id","gene_id","relation_type"),
         ##        "gene_refseq_uniprotkb_collab.gz" =
         ##         c("refseq_id","uniprot_id"),
         "gene2go.gz" = c("tax_id","gene_id","go_id","evidence",
           "go_qualifier", "go_description","pubmed_id","category")
         )
}

Save that file, and then you should be able to do

install.packages("AnnotationForge", repos = NULL, type = "source")

at an R prompt, where AnnotationForge is in your working directory.

ADD REPLY • link 11.1 years ago James W. MacDonald 68k

score 0 · Answer 3 · 2015-01-23

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 9.5 years ago

United States

OK I have verified that this patch is safe and have applied it to devel. It should be available in biocLite() within a day or two. I am also checking a patch for release which should be available in a similar time frame.

Thanks Jim!

Marc

ADD COMMENT • link 11.1 years ago Marc Carlson ★ 7.2k

score 0 · Answer 4 · 2015-02-03

0

Entering edit mode

oliver.tills • 0

@olivertills-7274

Last seen 9.8 years ago

European Union

Many thanks!

ADD COMMENT • link 11.0 years ago oliver.tills • 0