Question

OrganismDbi throws error for Oryza Sativa indica org.db preparation

0

Entering edit mode

rohitsatyam102 ▴ 20

@rohitsatyam102-24390

Last seen 47 minutes ago

India

Hi!

I was trying to use OrganismDbi package to build an annotation package for osativa indica.

I used the following code chunk but it throws an error at the end


library(OrganismDbi)
oindica <- makeOrganismDbFromBiomart(biomart="plants_mart",
                                     dataset="oindica_eg_gene",
                                     transcript_ids=NULL,
                                     circ_seqs=NULL,
                                     filter="",
                                     id_prefix="ensembl_",
                                     host="plants.ensembl.org",
                                     port=80,
                                     miRBaseBuild=NA,
                                     keytype = "ENSEMBL",
                                     orgdb = NA)

Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... Error in GenomeInfoDb:::lookup_tax_id_by_organism(organism) : 
  Oryza indica: unknown organism. Please use 'loadTaxonomyDb()' to see viable genus/species and taxonomy IDs.

How can I inform GenomeInfoDb about the correct Taxon-ID of oryza? I need this package to run clusterprofiler package for GO term analysis.

Also GenomeInfoDb:::lookup_tax_id_by_organism(organism) does not work and lacks manual for reference.

GenomeInfoDb OrganismDbi biomaRt • 1.9k views

ADD COMMENT • link updated 4.2 years ago by James W. MacDonald 67k • written 4.2 years ago by rohitsatyam102 ▴ 20

score 2 · Accepted Answer · 2020-10-30

2

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 3 days ago

United States

It's a bit trickier than simply adding something to GenomeInfoDb. The basic idea behind makeOrganismDbFromBiomart is that you first make the TxDb from the Biomart server, which by default will have Ensembl IDs as the central key. Then, if you don't specify an OrgDb, the AnnotationHub package will be used to see if there are any OrgDb packages available for the given taxonomic ID.

You are getting jammed up at the step where the correct taxonomic ID is supposed to be inferred from the TxDb you made. But that's not really the problem. The issue instead is that there isn't an OrgDb package on the AnnotationHub for you to use anyway.

> library(AnnotationHub)
 Loading required package: BiocFileCache
 Loading required package: dbplyr

 Attaching package: ‘AnnotationHub’

 The following object is masked from ‘package:Biobase’:

     cache

 > hub <- AnnotationHub()
   |======================================================================| 100%

 snapshotDate(): 2020-10-26
 > query(hub, c("oryza","orgdb"))
 AnnotationHub with 0 records
 # snapshotDate(): 2020-10-26

So there's no OrgDb for Rice, not to mention your particular strain.

And without an OrgDb you are stuck anyway. There is a function in AnnotationDbi called makeOrgDbFromNCBI that you can use to make the OrgDb, assuming that NCBI has a reasonable amount of data for that species. It takes forever to run, so I am not going to do an example. But even if you were to do that, the central ID for that OrgDb would be NCBI Gene IDs, so you would want a TxDb that has the same ID structure (what you get from Biomart are Ensembl Gene IDs). Unfortunately NCBI only has a GTF for Oryza sativa Japonica, not Indica. So hypothetically you could use makeTxDbFromGFF to make the TxDb for Japonica and assume it's close enough?

Ensembl does have the genome for O sativa indica, but there isn't a way that I know of to generate an OrgDb from Biomart.

So probably the best you will be able to do is generate the TxDb and OrgDb from NCBI data, ignoring the strain differences, and then use makeOrganismDbFromTxDb using the TxDb and OrgDb that you have created.

ADD COMMENT • link 4.2 years ago James W. MacDonald 67k

0

Entering edit mode

Hi James W. MacDonald

I'll buy it all and therefore I tried to find a way around this. I downloaded the GO annotations from plant biomart of ensemble in .csv format and later on processed it using AnnotationForge package to generate the org.Oindica.eg.db which is currently hosted here. I needed it for GO term enrichment analysis. I would be highly obliged if you could take out some time to see if it works well. In case if it does, we would like to add it to the OrgDb. It can be found here

ADD REPLY • link 4.2 years ago rohitsatyam102 ▴ 20

2

Entering edit mode

If all you need to do is a GO enrichment analysis, why do you want an OrganismDb? That sort of object is intended to facilitate mappings between genome-based annotations and functional-based annotations. An OrgDb like you have already built is sufficient for GO stuff. Unless you want to use topGO, which won't work with your package without some work on your part. But GOstats is fine with it.

> library(GOstats)
> univ <- keys(org.Oindica.eg.db)
> gns <- univ[sample(1:length(univ), 150)]
> p <- new("GOHyperGParams", geneIds = gns, universeGeneIds = univ, ontology = "BP", annotation = "org.Oindica.eg.db", conditional = TRUE)
> hyp <- hyperGTest(p)
'select()' returned 1:many mapping between keys and columns
> summary(hyp)
       GOBPID      Pvalue OddsRatio    ExpCount Count Size
1  GO:0000723 0.001842305  37.66250 0.064920499     2   12
2  GO:0042023 0.002169637  34.23636 0.070330540     2   13
3  GO:0006261 0.002350351  12.40699 0.265092037     3   49
4  GO:0010948 0.003730246  25.10000 0.091970707     2   17
5  GO:0042559 0.005156892  20.91250 0.108200831     2   20
6  GO:0010569 0.005410042       Inf 0.005410042     1    1
7  GO:0033314 0.005410042       Inf 0.005410042     1    1
8  GO:0051782 0.005410042       Inf 0.005410042     1    1
9  GO:0070716 0.005410042       Inf 0.005410042     1    1
10 GO:0080119 0.005410042       Inf 0.005410042     1    1
11 GO:0010256 0.005679807  19.81053 0.113610873     2   21
12 GO:0035335 0.006225966  18.81875 0.119020914     2   22
                                                                                            Term
1                                                                           telomere maintenance
2                                                                          DNA endoreduplication
3                                                                  DNA-dependent DNA replication
4                                                      negative regulation of cell cycle process
5                                             pteridine-containing compound biosynthetic process
6                          regulation of double-strand break repair via homologous recombination
7                                                             mitotic DNA replication checkpoint
8                                                           negative regulation of cell division
9  mismatch repair involved in maintenance of fidelity involved in DNA-dependent DNA replication
10                                                                          ER body organization
11                                                              endomembrane system organization
12

If you want to use topGO, then you probably have to generate your own version of annFUN.org that will generate the correct SQL query from your SQLite datbase. Which might be fun if you are into that sort of thing, but otherwise GOstats is what I normally use, and it works, so tha's what I would probably do.

ADD REPLY • link 4.2 years ago James W. MacDonald 67k