Question

Error with makeOrgPackageFromNCBI - no information found for species with tax id

0

Entering edit mode

irene.artuso • 0

@2c39dd6f

Last seen 13 months ago

Italy

Hi!

I am trying to use makeOrgPackageFromNCBI() to build my own organism annotation package for Acinetobacter baumannii ACICU (taxid: 405416), but I got the following error "Error in prepareDataFromNCBI(tax_id, NCBIFilesDir, outputDir, rebuildCache, : no information found for species with tax id 405416".

> library(AnnotationForge)
> makeOrgPackageFromNCBI(version = "0.1",
+                        author = "Irene <myemail@xxx.it>",
+                        maintainer = "Irene <myemail@xxx.it>",
+                        outputDir = ".",
+                        tax_id = "405416",
+                        genus = "Acinetobacter",
+                        species = "baumannii")
If files are not cached locally this may take awhile to assemble a 33 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.Please also see AnnotationHub for some pre-builtOrgDb downloads
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
Error in prepareDataFromNCBI(tax_id, NCBIFilesDir, outputDir, rebuildCache,  : 
  no information found for species with tax id 405416

I'd appreciate any feedback!

Thanks in advance,

Irene

makeOrgPackageFromNCBI • 644 views

ADD COMMENT • link updated 14 months ago by James W. MacDonald 65k • written 14 months ago by irene.artuso • 0

score 0 · Answer 1 · 2023-03-06

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 10 hours ago

United States

It means what it says - there is no information for that species that you can get from those files from NCBI.

ADD COMMENT • link 14 months ago James W. MacDonald 65k

0

Entering edit mode

So, how can I perform a GSEA analysis if I have gene names expressed as "locus_tag" from the NCBI Genbank file, and I cannot download the database of that specific strain but only that of Acinetobacter baumannii?

Thanks!

ADD REPLY • link 14 months ago irene.artuso • 0

0

Entering edit mode

That's a tough one. What you need are mappings from those locus tags to whatever ontology you want to use (GO or KEGG, presumably). Unfortunately, what NCBI appears to have are Gene IDs for the species (although they say it's strain K09-14? I know nothing about all the various species for this bacterium.). Anyway, there appears to be some infomation about taxid 470 in the data downloads.

$ awk '$1 == 470' gene_info | wc -l
3733

The GO mappings will come from a file downloaded from UniProt, and searching on their site for that species brings up results for multiple different strains. But if you use 470 as the taxid, there seem to be quite a few mappings. When you run makeOrgPackageFromNCBI it will download all the files, parse them, and put the data in a SQLite database called 'NCBI.sqlite'. If you re-run that function and specify rebuildCache = FALSE, you will simply re-use that SQLite database (which is what you should do!). Anyway, there are lots of GO mappings:

> library(RSQLite)
> con <- dbConnect(SQLite(), "NCBI.sqlite")
> dbGetQuery(con, "select * from altGO limit 5;")
  EntrezGene                     GO NCBItaxon
1    2947773             GO:0046782    654924
2    2947774 GO:0033644; GO:0016020    654924
3    4156251                           345201
4    4156252                           345201
5    2947775                           654924

## how many for taxid 470?
> dbGetQuery(con, "select count(*) from altGO where NCBItaxon='470';")
  count(*)
1    92413

## but some might be missing. How many have at least one GO term?
> z <- dbGetQuery(con, "select * from altGO where NCBItaxon='470';")
> sum(z$GO != "")
[1] 62873
## that seems reasonable.

The only remaining trick is to map whatever ID you have to something that will be in the resulting orgDb. I don't know what a locus_tag is, but hopefully it's a GenBank or RefSeq tag that you can match to NCBI IDs.

ADD REPLY • link 14 months ago James W. MacDonald 65k