Question

The GO annotation of fugal species outside of 20 model organisms is outdated

0

Entering edit mode

Bettina • 0

@2e7d10cb

Last seen 2.9 years ago

Germany

Dear team of Bioconductor,

The GO annotation of fungal species outside of 20 model organisms is outdated and a lot of genes miss GO annotation comparing to ensemble biomartr.

> bitr(x, 'SYMBOL', c("PMID", "GO", "ONTOLOGY"), botrytis)
'select()' returned 1:many mapping between keys and columns
           SYMBOL     PMID         GO ONTOLOGY
1   BCIN_01g03900 21876677       <NA>     <NA>
2   BCIN_01g03900 23104368       <NA>     <NA>
3   BCIN_01g03900 26913498       <NA>     <NA>
4   BCIN_01g03910 21876677       <NA>     <NA>
5   BCIN_01g03910 23104368       <NA>     <NA>
6   BCIN_01g03910 26913498       <NA>     <NA>
7   BCIN_01g06850 21876677 GO:0016491       MF
8   BCIN_01g06850 23104368 GO:0016491       MF
9   BCIN_01g06850 26913498 GO:0016491       MF
11  BCIN_01g09430 21876677 GO:0005576       CC
12  BCIN_01g09430 21876677 GO:0050525       MF

Could you please update the annotation of this organism in OrgDb at your earliest convenience? Thank you very much for this work and wish you and your lovely team a lovely summer, autumn, winter and spring!

Best,

Bettina

```

GO biomartr annotation ensemble fungi • 1.8k views

ADD COMMENT • link updated 3.0 years ago by James W. MacDonald 65k • written 3.0 years ago by Bettina • 0

0

Entering edit mode

You are being unnecessarily mysterious with your question/request. What OrgDb might this be? We could hypothetically try to guess what you have mapped to botyris, but wouldn't it be easier if you were to tell us in the first place? Also please note that Bioconductor works on a semi-annual release cycle (which is coming up in mid May or so), and all of the annotation packages are updated only at the release. If you want an updated package, there are functions in the AnnotationForge package that you can use to do so.

ADD REPLY • link 3.0 years ago James W. MacDonald 65k

0

Entering edit mode

Yes, I mapped to botrytis using Annotationhub and sorry for unclear explanation. However, it seems that it has been a long time the GO and Ontology are not updated in this unpopular organism. As I am not expertized in coding, I have no idea how to build an updated annotation package with AnnotationForge. It is good to know the updated version will be released in May. But can you please tell me will the annotation of this organism outside of 20 popular species also be updated?

Thank you very much for the information.

ADD REPLY • link 3.0 years ago Bettina • 0

score 0 · Answer 1 · 2021-04-30

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 10 hours ago

United States

If you look at the source of the data for that Db, it looks like this:

> library(AnnotationHub)
> hub <- AnnotationHub()
Bioconductor version 3.12 (BiocManager 1.30.12), ?BiocManager::install for help
> z <- query(hub, "botrytis")
> z
> AnnotationHub with 4 records
# snapshotDate(): 2020-10-27
# $dataprovider: FungiDB, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Botrytis cinerea B05.10, Botrytis cinerea_B05.10, Botrytis cinerea
# $rdataclass: GRanges, TxDb, OrgDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH65345"]]' 

            title                                             
  AH65345 | Botrytis cinerea B05.10 transcript information    
  AH74039 | Transcript information for Botrytis cinerea B05 10
  AH74678 | Transcript information for Botrytis cinerea B05 10
  AH86754 | org.Botrytis_cinerea_B05.10.eg.sqlite             

> mcols(z)$sourceurl[4]
[1]  "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz"
>

Which indicates that we get the information about this organism from NCBI and UniProt. The GO terms come from UniProt, and their idmapping_selected.tab.gz file. And if you go check on things there (using for example BCIN_01g03900 and BCIN_01g06850), you can see that in the latter case UniProt has a section called 'Function' that includes GO mappings, but that the former case does not.

Unfortunately what we provide for annotations are simply re-packaging of existing data, and we are reliant on others to generate the data that we use for the annotations. It may well be that UniProt is behind the times with their GO annotations, in which case it may be better for you to rely on the biomaRt package instead, if Ensembl has more updated data.

ADD COMMENT • link 3.0 years ago James W. MacDonald 65k

0

Entering edit mode

Is it possible to also collect updated information from ensemble fungi?

ADD REPLY • link 3.0 years ago Bettina • 0

0

Entering edit mode

I'm not sure what you are asking. If you are asking if we can provide data from Ensembl fungi in an OrgDb package, then in general the answer is no. The 'eg' part of the SQLite file name is a short name for Entrez Gene, which is what NCBI Gene IDs used to be called. And it implies that the central key for this package is the NCBI Gene ID, not the Ensembl Gene ID. Changing the underlying schema of the OrgDb packages isn't in the cards, particularly just because of this one package. This is particularly true since the biomaRt package can be used to query for Ensembl Gene IDs, and any other IDs that are appended to those IDs.

ADD REPLY • link 3.0 years ago James W. MacDonald 65k

0

Entering edit mode

Found out that the organism is not supported via AnnotationForge. Do you have any idea how can I use the ensemble biomartr to conduct GO term analysis like using clusterProfiler with OrgDb? Thank you very much!

available.db0pkgs() [1] "anopheles.db0" "arabidopsis.db0" "bovine.db0" "canine.db0" "chicken.db0" "chimp.db0" "ecoliK12.db0" "ecoliSakai.db0" [9] "fly.db0" "human.db0" "malaria.db0" "mouse.db0" "pig.db0" "rat.db0" "rhesus.db0" "worm.db0"
[17] "xenopus.db0" "yeast.db0" "zebrafish.db0"

ADD REPLY • link 3.0 years ago Bettina • 0

0

Entering edit mode

If all you are trying to do is a regular Fisher's exact test using GO terms, then I would probably use the kegga function in limma. It's intended to do a Fisher's exact test using KeGG terms, but you can also use it for GO as well. You would need to provide a gene.pathway argument, which is described in the help page:

Usage:

     ## S3 method for class 'MArrayLM'
     goana(de, coef = ncol(de), geneid = rownames(de), FDR = 0.05, trend = FALSE, ...)
     ## S3 method for class 'MArrayLM'
     kegga(de, coef = ncol(de), geneid = rownames(de), FDR = 0.05, trend = FALSE, ...)
     ## Default S3 method:
     goana(de, universe = NULL, species = "Hs", prior.prob = NULL, covariate=NULL,
           plot=FALSE, ...)
     ## Default S3 method:
     kegga(de, universe = NULL, restrict.universe = FALSE, species = "Hs", species.KEGG = NULL,
           convert = FALSE, gene.pathway = NULL, pathway.names = NULL,
           prior.prob = NULL, covariate=NULL, plot=FALSE, ...)
     getGeneKEGGLinks(species.KEGG = "hsa", convert = FALSE)
     getKEGGPathwayNames(species.KEGG = NULL, remove.qualifier = FALSE)

Arguments:

      de: a character vector of Entrez Gene IDs, or a list of such
          vectors, or an 'MArrayLM' fit object.

    coef: column number or column name specifying for which coefficient
          or contrast differential expression should be assessed.

  geneid: Entrez Gene identifiers. Either a vector of length 'nrow(de)'
          or the name of the column of 'de$genes' containing the Entrez
          Gene IDs.

     FDR: false discovery rate cutoff for differentially expressed
          genes. Numeric value between 0 and 1.

 species: character string specifying the species.  Possible values
          include '"Hs"' (human), '"Mm"' (mouse), '"Rn"' (rat), '"Dm"'
          (fly) or '"Pt"' (chimpanzee), but other values are possible
          if the corresponding organism package is available.  See
          'alias2Symbol' for other possible values.  Ignored if
          'species.KEGG' or is not 'NULL' or if 'gene.pathway' and
          'pathway.names' are not 'NULL'.

species.KEGG: three-letter KEGG species identifier. See <URL:
          http://www.kegg.jp/kegg/catalog/org_list.html> or <URL:
          http://rest.kegg.jp/list/organism> for possible values.
          Alternatively, if 'de' contains KEGG ortholog Ids ('"k00001"'
          etc) instead of gene Ids, then set 'species.KEGG="ko"'.  This
          argument is ignored if 'gene.pathway' and 'pathway.names' are
          both not 'NULL'.

 convert: if 'TRUE' then KEGG gene identifiers will be converted to
          NCBI Entrez Gene identifiers.  Note that KEGG IDs are the
          same as Entrez Gene IDs for most species anyway.

gene.pathway: data.frame linking genes to pathways.  First column gives
          gene IDs, second column gives pathway IDs. By default this is
          obtained automatically by 'getGeneKEGGLinks(species.KEGG)'.

You can get a data.frame that has your IDs in the first column and the GO IDs in the second using biomaRt, and then go from there. The assumption for kegga is that you are using NCBI Gene IDs, but it doesn't have to be that way. Any set of IDs that match what is in the first column of your gene.pathway data.frame will work.

ADD REPLY • link 3.0 years ago James W. MacDonald 65k

0

Entering edit mode

Dear James,

Thank you very much for your time and patient.

I finally would like to build an annotation package with AnnotationForge. In the rule, it requires the first column in the data.frame must be central gene ID->GID. However, there are several GID to unique GO_id. When I created the data.frame, it appeared an error due to the replicated rows. Do I miss something in that?

makeOrgPackage(gene_info=gene_info,

go=gene2go,

maintainer = "",

author = "",

version="0.0.1",

outputDir = ".",

tax_id=tax_id,

genus=genus,

species=species,

goTable="go") Error in FUN(X[[i]], ...) : data.frames in '...' cannot contain duplicated rows

ADD REPLY • link 3.0 years ago Bettina • 0

0

Entering edit mode

You won't be able to build a package with empty maintainer and author fields. In addition the maintainer field has to include an email in brackets. So it should look like Bettina <bettina.email@gmail.com> or whatever makes sense.

Anyway, the error says the data.frames cannot contain duplicated rows. It doesn't say anything about having duplicated values in one column. Those are different things. As an example, we can use the example code from ?makeOrgPackage

>      finchFile <- system.file("extdata","finch_info.txt",package="AnnotationForge")
>      finch <- read.table(finchFile,sep="\t")
>      
>      ## not that this is how it should always be, but that it *could* be this way.
>      fSym <- finch[,c(2,3,9)]
>      fSym <- fSym[fSym[,2]!="-",]
>      fSym <- fSym[fSym[,3]!="-",]
>      colnames(fSym) <- c("GID","SYMBOL","GENENAME")
>      
>      fChr <- finch[,c(2,7)]
>      fChr <- fChr[fChr[,2]!="-",]
>      colnames(fChr) <- c("GID","CHROMOSOME")
>      
>      finchGOFile <- system.file("extdata","GO_finch.txt",package="AnnotationForge")
>      fGO <- read.table(finchGOFile,sep="\t")
>      fGO <- fGO[fGO[,2]!="",]
>      fGO <- fGO[fGO[,3]!="",]
>      colnames(fGO) <- c("GID","GO","EVIDENCE")
>      
>      makeOrgPackage(gene_info=fSym, chromosome=fChr, go=fGO,
+                     version="0.1",
+                     maintainer="Some One <so@someplace.org>",
+                     author="Some One <so@someplace.org>",
+                     outputDir = ".",
+                     tax_id="59729",
+                     genus="Taeniopygia",
+                     species="guttata",
+                     goTable="go")
Populating genes table:
genes table filled
Populating gene_info table:
gene_info table filled
Populating chromosome table:
chromosome table filled
Populating go table:
go table filled
table metadata filled

'select()' returned many:1 mapping between keys and columns
Dropping GO IDs that are too new for the current GO.db
Populating go table:
go table filled
Populating go_bp table:
go_bp table filled
Populating go_cc table:
go_cc table filled
Populating go_mf table:
go_mf table filled
'select()' returned many:1 mapping between keys and columns
Populating go_bp_all table:
go_bp_all table filled
Populating go_cc_all table:
go_cc_all table filled
Populating go_mf_all table:
go_mf_all table filled
Populating go_all table:
go_all table filled
Creating package in ./org.Tguttata.eg.db 
Now deleting temporary database file
[1] "./org.Tguttata.eg.db"

## So that was successful. Now let's try again.
## add some duplicated rows...
> fSym <- rbind(fSym, fSym[1:5,])

>  makeOrgPackage(gene_info=fSym, chromosome=fChr, go=fGO,
                    version="0.1",
                    maintainer="Some One <so@someplace.org>",
                    author="Some One <so@someplace.org>",
                    outputDir = ".",
                    tax_id="59729",
                    genus="Taeniopygia",
                    species="guttata",
                    goTable="go")

Error in FUN(X[[i]], ...) : 
  data.frames in '...' cannot contain duplicated rows

ADD REPLY • link 3.0 years ago James W. MacDonald 65k