Bug in makeOrgPackageFromNCBI from AnnotationForge?

1

Entering edit mode

Marco Blanchette ▴ 220

@marco-blanchette-5439

Last seen 11.2 years ago

United States/Kansas City/Stowers Insti…

I am working on a project involving Schizosaccharomyces pombe as a source for genomic analysis and love to use ReportingTools html producing wrappers. However, I am struggling as there is no AnnotationDbi package available for this organism. I decided to finally take the plunge and try to see if I could be one myself using AnnotationForge and was quite exciting to find that I could perhaps melt one simply by using the makeOrgPackageFromNCBI(). Most likely, something went wrong and I suspect a bug somewhere in the pipeline. I have not dug deeper then trying to build the package and use it hoping that someone closer to the code could shed some lights. Here the steps I took:' > library(AnnotationForge) > makeOrgPackageFromNCBI(version = "0.1", author = "Marco Blanchette <mab at="" stowers.org="">", maintainer = "Marco Blanchette <mab at="" stowers.org="">", outputDir = ".", tax_id = "4896", genus = "Schizosaccharomyces", species = "pombe") This step succeeded with only a warning: Warning message: In .makeSimpleTable(ug, table = "unigene", con) : no values found for table unigene in this data chunk. I didn't think this was critical enough to raise any red flag, so I then proceeded with the installation that went smoothly > library(devtools) > install('org.Spombe.eg.db') > library('org.Spombe.eg.db') Then I try to use it with ReportingTools publish() but fail as it returns an error related to Entrez ID which I had a conversion table from biomaRt. I dug a bit deeper and found that none of the genes I was querying were in the database to finally realize that there was only 38 entries int the org.Spombe.eg.db database I had just created and installed... Check this out: > keytypes(org.Spombe.eg.db) [1] "ENTREZID" "ACCNUM" "ALIAS" "CHR" "PMID" "REFSEQ" [7] "SYMBOL" "UNIGENE" "GENENAME" "GO" "EVIDENCE" "ONTOLOGY" Looking good! However: > length(keys(org.Spombe.eg.db,'ENTREZID')) [1] 38 Can someone close enough to the code shed some lights has to whether there is a bug in AnnotationForge or whether it is the NCBI database that is not conforming to what is expected? For instance, biomart has 5117 entrez ID > library(biomaRt) > mart <- useMart("fungi_mart_18","spombe_eg_gene") > ensembl2entrez <- getBM(c('ensembl_gene_id','entrezgene'),mart=mart) > sum(!is.na(ensembl2entrez$entrezgene)) [1] 5117 The ids I tested on the NCBI website return the correct genes. However, only 10 of the AnnotationForge EntrezID (out of a skirmish 38 ids) are found in biomaRt > sum(keys(org.Spombe.eg.db,'ENTREZID') %in% ensembl2entrez$entrezgene) [1] 10 Again, I would appreciate any comments or suggestions as to whether this is a bug or something I did wrong or a miss alignment between the NCBI S. pombe annotation and what is expected by AnnotationForge. Thanks -- Marco Blanchette, Ph.D. Assistant Investigator Stowers Institute for Medical Research 1000 East 50th St. Kansas City, MO 64110 Tel: 816-926-4071 Cell: 816-726-8419 Fax: 816-926-2018

Alignment Annotation Organism Schizosaccharomyces pombe biomaRt AnnotationForge Alignment • 2.3k views

ADD COMMENT • link updated 12.4 years ago by Marc Carlson ★ 7.2k • written 12.4 years ago by Marco Blanchette ▴ 220

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 9.5 years ago

United States

Hi Marco, So the function you are using is downloading this file from NCBI: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz And if you grab that file and grep for the lines that start with your tax ID you will find about 38 lines in it to indicate unique entrez gene IDs. That means that we currently only have 38 gene IDs to parse from NCBI (at least from the files that they are giving us). Its frustrating for me too that fission yeast is not better, but this is what we have. :( I am already planning something more general so that when this happens you will not be stuck using just one annotation resource for org packages, but unfortunately it is not going to be finished tomorrow. But hopefully I will have it in time to be in the next release. Marc On 08/23/2013 07:24 PM, Blanchette, Marco wrote: > I am working on a project involving Schizosaccharomyces pombe as a source for genomic analysis and love to use ReportingTools html producing wrappers. However, I am struggling as there is no AnnotationDbi package available for this organism. I decided to finally take the plunge and try to see if I could be one myself using AnnotationForge and was quite exciting to find that I could perhaps melt one simply by using the makeOrgPackageFromNCBI(). Most likely, something went wrong and I suspect a bug somewhere in the pipeline. I have not dug deeper then trying to build the package and use it hoping that someone closer to the code could shed some lights. Here the steps I took:' > >> library(AnnotationForge) >> makeOrgPackageFromNCBI(version = "0.1", > author = "Marco Blanchette <mab at="" stowers.org="">", > maintainer = "Marco Blanchette <mab at="" stowers.org="">", > outputDir = ".", > tax_id = "4896", > genus = "Schizosaccharomyces", > species = "pombe") > > This step succeeded with only a warning: > > Warning message: > In .makeSimpleTable(ug, table = "unigene", con) : > no values found for table unigene in this data chunk. > > I didn't think this was critical enough to raise any red flag, so I then proceeded with the installation that went smoothly > >> library(devtools) >> install('org.Spombe.eg.db') >> library('org.Spombe.eg.db') > Then I try to use it with ReportingTools publish() but fail as it returns an error related to Entrez ID which I had a conversion table from biomaRt. I dug a bit deeper and found that none of the genes I was querying were in the database to finally realize that there was only 38 entries int the org.Spombe.eg.db database I had just created and installed... Check this out: > >> keytypes(org.Spombe.eg.db) > [1] "ENTREZID" "ACCNUM" "ALIAS" "CHR" "PMID" "REFSEQ" > [7] "SYMBOL" "UNIGENE" "GENENAME" "GO" "EVIDENCE" "ONTOLOGY" > > Looking good! However: > >> length(keys(org.Spombe.eg.db,'ENTREZID')) > [1] 38 > > Can someone close enough to the code shed some lights has to whether there is a bug in AnnotationForge or whether it is the NCBI database that is not conforming to what is expected? For instance, biomart has 5117 entrez ID > >> library(biomaRt) >> mart <- useMart("fungi_mart_18","spombe_eg_gene") >> ensembl2entrez <- getBM(c('ensembl_gene_id','entrezgene'),mart=mart) >> sum(!is.na(ensembl2entrez$entrezgene)) > [1] 5117 > > The ids I tested on the NCBI website return the correct genes. However, only 10 of the AnnotationForge EntrezID (out of a skirmish 38 ids) are found in biomaRt > >> sum(keys(org.Spombe.eg.db,'ENTREZID') %in% ensembl2entrez$entrezgene) > [1] 10 > > Again, I would appreciate any comments or suggestions as to whether this is a bug or something I did wrong or a miss alignment between the NCBI S. pombe annotation and what is expected by AnnotationForge. > > Thanks

ADD COMMENT • link 12.4 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Hi Marc, Darn, you're right? there is only 39 entries for S. pombe in the gene_info file? $ wget -qO- ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz |gunzip -c | awk '$1~/^4896$/{print $0}' |wc -l 39 And this is despite the fact that there is 13,211 gene entries associated with this organism (taken from the taxonomy browser web page)!! Is that a bug that should be reported to NCBI? The fact that their publicly available data files don't reflect their internal representations is disconcerting. I feel there something that should be done to rectify the problem. On the other hands, thanks a lot for exploring alternative strategies to populate org database, let me know if you need me to beta test it before the next release. Marco -- Marco Blanchette, Ph.D. Stowers Institute for Medical Research 1000 East 50th Street Kansas City MO 64110 www.stowers.org On 8/26/13 7:25 PM, "Marc Carlson" <mcarlson at="" fhcrc.org=""> wrote: >Hi Marco, > >So the function you are using is downloading this file from NCBI: > >ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz > >And if you grab that file and grep for the lines that start with your >tax ID you will find about 38 lines in it to indicate unique entrez gene >IDs. That means that we currently only have 38 gene IDs to parse from >NCBI (at least from the files that they are giving us). Its frustrating >for me too that fission yeast is not better, but this is what we have. :( > >I am already planning something more general so that when this happens >you will not be stuck using just one annotation resource for org >packages, but unfortunately it is not going to be finished tomorrow. >But hopefully I will have it in time to be in the next release. > > > Marc > > >On 08/23/2013 07:24 PM, Blanchette, Marco wrote: >> I am working on a project involving Schizosaccharomyces pombe as a >>source for genomic analysis and love to use ReportingTools html >>producing wrappers. However, I am struggling as there is no >>AnnotationDbi package available for this organism. I decided to finally >>take the plunge and try to see if I could be one myself using >>AnnotationForge and was quite exciting to find that I could perhaps melt >>one simply by using the makeOrgPackageFromNCBI(). Most likely, something >>went wrong and I suspect a bug somewhere in the pipeline. I have not dug >>deeper then trying to build the package and use it hoping that someone >>closer to the code could shed some lights. Here the steps I took:' >> >>> library(AnnotationForge) >>> makeOrgPackageFromNCBI(version = "0.1", >> author = "Marco Blanchette <mab at="" stowers.org="">", >> maintainer = "Marco Blanchette >><mab at="" stowers.org="">", >> outputDir = ".", >> tax_id = "4896", >> genus = "Schizosaccharomyces", >> species = "pombe") >> >> This step succeeded with only a warning: >> >> Warning message: >> In .makeSimpleTable(ug, table = "unigene", con) : >> no values found for table unigene in this data chunk. >> >> I didn't think this was critical enough to raise any red flag, so I >>then proceeded with the installation that went smoothly >> >>> library(devtools) >>> install('org.Spombe.eg.db') >>> library('org.Spombe.eg.db') >> Then I try to use it with ReportingTools publish() but fail as it >>returns an error related to Entrez ID which I had a conversion table >>from biomaRt. I dug a bit deeper and found that none of the genes I was >>querying were in the database to finally realize that there was only 38 >>entries int the org.Spombe.eg.db database I had just created and >>installed... Check this out: >> >>> keytypes(org.Spombe.eg.db) >> [1] "ENTREZID" "ACCNUM" "ALIAS" "CHR" "PMID" "REFSEQ" >> [7] "SYMBOL" "UNIGENE" "GENENAME" "GO" "EVIDENCE" "ONTOLOGY" >> >> Looking good! However: >> >>> length(keys(org.Spombe.eg.db,'ENTREZID')) >> [1] 38 >> >> Can someone close enough to the code shed some lights has to whether >>there is a bug in AnnotationForge or whether it is the NCBI database >>that is not conforming to what is expected? For instance, biomart has >>5117 entrez ID >> >>> library(biomaRt) >>> mart <- useMart("fungi_mart_18","spombe_eg_gene") >>> ensembl2entrez <- getBM(c('ensembl_gene_id','entrezgene'),mart=mart) >>> sum(!is.na(ensembl2entrez$entrezgene)) >> [1] 5117 >> >> The ids I tested on the NCBI website return the correct genes. However, >>only 10 of the AnnotationForge EntrezID (out of a skirmish 38 ids) are >>found in biomaRt >> >>> sum(keys(org.Spombe.eg.db,'ENTREZID') %in% ensembl2entrez$entrezgene) >> [1] 10 >> >> Again, I would appreciate any comments or suggestions as to whether >>this is a bug or something I did wrong or a miss alignment between the >>NCBI S. pombe annotation and what is expected by AnnotationForge. >> >> Thanks > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.4 years ago Marco Blanchette ▴ 220

0

Entering edit mode

Marc, I find this to still be the state of affairs and wonder if you have any suggested alternative approaches.

For example, there are only 36 rows of gene_info in the org.Scerevisiae.eg.db, forged as follows:

makeOrgPackageFromNCBI(
  version = "0.0.1"
 ,outputDir = "test"
 ,tax_id = "4932"
 ,genus = "Saccharomyces"
 ,species = "cerevisiae"
 ,author = "malcolm.cook@stowers.org"
 ,maintainer = "Malcolm Cook <malcolm.cook@stowers.org>"
 ,rebuildCache=FALSE
 ,verbose=TRUE
)

FWIW: my specific aim now it to have an orgDB for sacCer that will interoperate with clusterProfiler, as per: https://support.bioconductor.org/p/118647/ - if you have any insight there it would be most welcome.

ADD REPLY • link 6.9 years ago Malcolm Cook ★ 1.6k

Login before adding your answer.