Hi,
I am trying to run makeOrgPackageFromNCBI, as follows -
> makeOrgPackageFromNCBI(version="0.1", + author="otills <*******@*******>", + maintainer="otills <*******@*******>", + outputDir=".", + tax_id="582868", + genus="Mollusca", + species="Radix balthica")
However, after approx 30 mins I get the error -
Getting data for gene2pubmed.gz Loading required package: RCurl Loading required package: bitops extracting only data for our organism from : gene2pubmed Getting data for gene2accession.gz Error in sqliteSendQuery(con, statement, bind.data) : error in statement: duplicate column name: NA
I've tried re-running (including on different machines), but I get this error consistently. The file sizes (gene2accession.gz, gene2pubmed.gz and NCBI.sqlite) are always the same size at the time of crash.
Can anyone suggest what the problem might be?
Oli
> sessionInfo() R version 3.1.2 (2014-10-31) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] RCurl_1.95-4.5 bitops_1.0-6 AnnotationForge_1.8.1 org.Hs.eg.db_3.0.0 RSQLite_1.0.0 [6] DBI_0.3.1 AnnotationDbi_1.28.1 GenomeInfoDb_1.2.4 IRanges_2.0.1 S4Vectors_0.4.0 [11] Biobase_2.26.0 BiocGenerics_0.12.1 BiocInstaller_1.16.1 loaded via a namespace (and not attached): [1] tools_3.1.2 >

Hi Oliver,
It turns out that NCBI has changed the format of the gene2accession and gene2refseq files by adding three additional columns. Since the code that parses these files expected three fewer columns, the result is as you see. After making some small changes in the underlying code in the Devel branch, I get things to work:
> makeOrgPackageFromNCBI(version = "0.0.1", author = "me", maintainer = "me <me@mine.org>", outputDir = ".", tax_id = "7227", genus = "Drosophila", species = "melanogaster", NCBIFilesDir = ".") If this is the 1st time you have run this function, it may take a long time (over an hour) to download needed files and assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day. Getting data for gene2pubmed.gz extracting only data for our organism from : gene2pubmed Getting data for gene2accession.gz extracting only data for our organism from : gene2accession Getting data for gene2refseq.gz extracting only data for our organism from : gene2refseq Getting data for gene2unigene Loading required package: RCurl Loading required package: bitops extracting only data for our organism from : gene2unigene getting all data for our organism from : gene2unigene Getting data for gene_info.gz extracting only data for our organism from : gene_info Getting data for gene2go.gz extracting only data for our organism from : gene2go Populating genes table: genes table filled Populating pubmed table: pubmed table filled Populating chromosomes table: chromosomes table filled Populating gene_info table: gene_info table filled Populating entrez_genes table: entrez_genes table filled Populating alias table: alias table filled Populating refseq table: refseq table filled Populating accessions table: accessions table filled Populating go table: go table filled Populating unigene table: unigene table filled table metadata filled Loading required package: GO.db Dropping GO IDs that are too new for the current GO.db Populating go table: go table filled Populating go_all table: go_all table filled Creating package in ./org.Dmelanogaster.eg.db Now deleting temporary database file [1] "org.Dmelanogaster.eg.sqlite" > install.packages("org.Dmelanogaster.eg.db/", repos=NULL) * installing *source* package ‘org.Dmelanogaster.eg.db’ ... ** R ** inst ** preparing package for lazy loading ** help *** installing help indices ** building package indices ** testing if installed package can be loaded * DONE (org.Dmelanogaster.eg.db)Also note that the genus is "Drosophila", and the species is "melanogaster", and the maintainer has to be something like "me <me@mine.org>" with the brackets and all that, or the package won't install correctly.
I'll send a patch to Marc Carlson, who is the maintainer for AnnotationForge, and hopefully we will get an updated version pushed in the next day or so.
In the meantime, if you are impatient, you can download the source package and change the function .primaryFiles() in NCBI_ftp.R to be like this:
.primaryFiles <- function(){ list( "gene2pubmed.gz" = c("tax_id","gene_id", "pubmed_id"), "gene2accession.gz" = c("tax_id","gene_id","status","rna_accession", "rna_gi","protein_accession","protein_gi","genomic_dna_accession", "genomic_dna_gi","genomic_start","genomic_end","orientation", "assembly","peptide_accession","peptide_gi","symbol"), ## This one might be needed later "gene2refseq.gz" = c("tax_id","gene_id","status","rna_accession", "rna_gi","protein_accession","protein_gi","genomic_dna_accession", "genomic_dna_gi","genomic_start","genomic_end","orientation", "assembly","peptide_accession","peptide_gi","symbol"), "gene2unigene" = c("gene_id","unigene_id"), "gene_info.gz" = c("tax_id","gene_id","symbol","locus_tag", "synonyms","dbXrefs","chromosome","map_location","description", "gene_type","nomenclature_symbol","nomenclature_name", "nomenclature_status","other_designations", "modification_date"), ## "mim2gene.gz" = c("mim_id","gene_id","relation_type"), ## "gene_refseq_uniprotkb_collab.gz" = ## c("refseq_id","uniprot_id"), "gene2go.gz" = c("tax_id","gene_id","go_id","evidence", "go_qualifier", "go_description","pubmed_id","category") ) }Save that file, and then you should be able to do
install.packages("AnnotationForge", repos = NULL, type = "source")at an R prompt, where AnnotationForge is in your working directory.