I'm trying to do pathway analysis for RNA-seq data (KEGG, GSEA, GO). My organism is Bacteroides thetaiotaomicron and I'm having trouble finding an annotation database. AnnotationHub has one library (Inparanoid8Db) that from what I can find is old and not useable with packages for pathway analysis:
> library(AnnotationHub)
> hub <- AnnotationHub()
snapshotDate(): 2020-09-03
> query(hub, "Bacteroides thetaiotaomicron")
> AnnotationHub with 1 record
> # snapshotDate(): 2020-09-03
> # names(): AH10465
> # $dataprovider: Inparanoid8
> # $species: Bacteroides thetaiotaomicron
> # $rdataclass: Inparanoid8Db
> # $rdatadateadded: 2014-03-31
> # $title: hom.Bacteroides_thetaiotaomicron.inp8.sqlite
> # $description: Inparanoid 8 annotations about Bacteroides thetaiotaomicron
> # $taxonomyid: 226186
> # $genome: inparanoid8 genomes
> # $sourcetype: Inparanoid
> # $sourceurl: http://inparanoid.sbc.su.se/download/current/Orthologs/B.thetaiotaomicron
> # $sourcesize: NA
> # $tags: c("Inparanoid", "Gene", "Homology", "Annotation")
> # retrieve record with 'object[["AH10465"]]'
I also tried creating a library using AnnotationForge, which gave me files containing seemingly all organisms and produced an error message:
> makeOrgPackageFromNCBI(version = "0.1",
+ author = "Some One <so@someplace.org>",
+ maintainer = "Some One <so@someplace.org>"",
+ outputDir = getwd(),
+ tax_id = "226186",
+ genus = "Bacteroides",
+ species = "thetaiotaomicron")
Error in prepareDataFromNCBI(tax_id, NCBIFilesDir, outputDir, rebuildCache, :
no information found for species with tax id 226186
Another problem I have is that the NCBI entry for Bacteroides thetaiotaomicron has been "re-annotated". My experimental data identifies genes by their new locus tags, but all the pathway analysis packages require KEGG, ncbi-geneid, ncbi-proteinid or uniprot input IDs. I don't know how to convert the IDs because conversion software does not use the new locus tags.
How can I get around these issues to download or create an annotation database? And what can I do about the mismatch between IDs? Thanks!
Hi, what do you mean by 're-annotated'? On the other point, can you share some of these locus tags? Does either of the two previous answers here help: https://support.bioconductor.org/p/131067/#131118
According to this page: https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/faq/#FAQ1, NCBI is re-annotating bacterial genomes. This is the only information I could find.
I actually found a bioconductor package (GAGE) that can generate a database for B. thetaiotaomicron. However my gene ID types are still mismatched.
Here is an example of a gene in my data set: BTRS01505. If you search this gene on NCBI, this is the entry: https://www.ncbi.nlm.nih.gov/gene/1075506. NCBI notes "this record has been discontinued." The Gene Symbol and the Locus Tag are both BTRS01505. The Old Locus Tag is BT0307 and the Gene ID is 1075506. This is a problem for me because my data file (csv) has the new locus tag BTRS01505 but the GAGE database has KEGG IDs (which correspond to the old locus tag) and Entrez IDs (which correspond to NCBI Gene ID). Therefore, BTRS01505 is identified as either BT0307 or 1075506 in the database.
The NCBI genome entry (https://www.ncbi.nlm.nih.gov/genome/1093?genomeassemblyid=300528) has three annotation files. I think the GFF and GenBank files have all gene ID types while the csv file only contains the new locus tags. If the csv file had either the old locus tags or the gene IDs, I could easily add those to my data file. I don't know if it's possible for me to use the GFF or GenBank file to convert the IDs or add the IDs to my data file.
I have tried online gene ID converters and have not had success because I don't think they recognize the new locus tags.
This seems like a nightmare. Is it not possible to re-process the data using the correct annotation?