Question

how to use "non-standard" species for KEGG / GO analysis in limma?

3

Entering edit mode

Guido Hooiveld ★ 3.9k

@guido-hooiveld-2020

Last seen 1 day ago

Wageningen University, Wageningen, the …

I am using limma as my standard work horse to analyze microarray data. I noticed that since the last release limma also is able to perform GO and KEGG overrepresentation analyses (functions kegga and goanna). However, input for these functions is currently limited to a few species; from the help page: species: species identifier. Possible values are "Hs", "Mm", "Rn" or "Dm".

Since I regularly analyze data from species other than these four, I wondered whether it would be possible to extend these list of species, ideally allowing any species for which an org.Xx.eg.db and/or KEGG annotations are available. FYI: when using the new AnnotationHub infrastructure apparently more than 1000 OrgDbs are available.

Alternatively, it would be very handy if an additional argument could be provided to goana to specify the 2-letter species identifier, and to kegga to "manually map" the 2-letter species ID to the 3-letter KEGG species identifier. But maybe there are better /easier solutions...

Background: I am currently working with an Affymetrix Chinese Hamster (Cricetulus griseus) dataset; its corresponding OrgDb is available using AnnotationHub, and pathway info for this species is also available at KEGG (thus: "Cg" = "cge").

Thanks,

Guido

> library(AnnotationHub)

> hub = AnnotationHub()

> query(hub, c("OrgDb"))
AnnotationHub with 1019 records
# snapshotDate(): 2015-12-29
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Escherichia coli, Acanthamoeba castellanii_str._Neff, Acanthisit...
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype
# retrieve records with, e.g., 'object[["AH48006"]]'

            title                        
  AH48006 | org.Camelina_sativa.eg.sqlite
  AH48007 | org.Glycine_max.eg.sqlite    
  AH48008 | org.Malus_domestica.eg.sqlite
  AH48009 | org.Zea_mays.eg.sqlite       
  AH48010 | org.Brassica_rapa.eg.sqlite  
  ...       ...                          
  AH49587 | org.Ce.eg.db.sqlite          
  AH49588 | org.Xl.eg.db.sqlite          
  AH49589 | org.Sc.sgd.db.sqlite         
  AH49590 | org.Dr.eg.db.sqlite          
  AH49591 | org.Pf.plasmo.db.sqlite      
>

> # for Chinese hamster

># AH48061 | org.Cricetulus_griseus.eg.sqlite

> org.Cg.eg.db <- hub[["AH48061"]]
> org.Cg.eg.db
OrgDb object:
| DBSCHEMAVERSION: 2.1
| DBSCHEMA: NOSCHEMA_DB
| ORGANISM: Cricetulus griseus
| SPECIES: Cricetulus griseus
| CENTRALID: GID
| Taxonomy ID: 10029
| Db type: OrgDb
| Supporting package: AnnotationDbi

Please see: help('select') for usage information
>

limma kegga goanna • 4.7k views

ADD COMMENT • link updated 8.3 years ago by Gordon Smyth 50k • written 8.3 years ago by Guido Hooiveld ★ 3.9k

Gordon Smyth · Accepted Answer · 2016-01-06

5

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 3 hours ago

WEHI, Melbourne, Australia

"I wondered whether it would be possible to extend these list of species, ideally allowing any species for which an org.Xx.eg.db and/or KEGG annotations are available."

As far as GO is concerned, that is actually what we have done. The reason why Hs, Mm, Rn and Dm (and now Pt) are the only choices for species is that these are the only org.XX.eg.db organism packages available from Bioconductor.

It would be relatively easy for us to allow kegga() to accept any organism identifier recognized by KEGG.

What you ask for GO is somewhat harder. Note that org.Hs.eg.db is a package containing annotation objects. The object org.Cg.eg.db that you have created is just an object, not a package. It has a somewhat different structure to the packages and would need to be manipulated using different commands. You are essentially asking us to accept any OrgDb object that a user creates. That makes us a bit nervous because we would have little control over what the object contains. It also opens us up to having to support a lot of things that we don't do ourselves. I will have a look and see what we can do.

One day later:

I have now committed limma 3.26.4 to the current Bioconductor release. You can now use kegga() with any KEGG organism, for example:

k <- kegga(fit, species.KEGG="cge")
topKEGG(k)

You can also now use kegga() to do an analysis with your own customized annotation. For example you can use it to do the GO analysis for your model organism. For example:

library(AnnotationHub)
hub = AnnotationHub()
org.Cg.eg.db <- hub[["AH48061"]]
gene.go <- select(org.Cg.eg.db, keys(org.Cg.eg.db), "GOALL")
library(GO.db)
go.names <- select(GO.db, keys(GO.db), "TERM")
k <- kegga(fit, gene.pathway=gene.go, pathway.names=go.names)
topKEGG(k)

One year later:

goana() and kegga() now can be used with any species XX for each an annotation package org.XX.eg.db exists.

ADD COMMENT • link 8.3 years ago • updated 7.3 years ago Gordon Smyth 50k

0

Entering edit mode

FWIW the AnnotationHub OrgDb objects obey the usual interface, e.g.,

> columns(org.Cg.eg.db)
 [1] "ACCNUM"      "ALIAS"       "CHR"         "ENTREZID"    "EVIDENCE"   
 [6] "EVIDENCEALL" "GENENAME"    "GID"         "GO"          "GOALL"      
[11] "ONTOLOGY"    "ONTOLOGYALL" "PMID"        "REFSEQ"      "SYMBOL"     
> head(keys(org.Cg.eg.db, "GO"))   ## GO identifiers
[1] "GO:0000002" "GO:0000012" "GO:0000014" "GO:0000015" "GO:0000028"
[6] "GO:0000030"
> head(keys(org.Cg.eg.db))   ## Primary keys
[1] "3979178" "3979179" "3979180" "3979181" "3979182" "3979183"
> head(select(org.Cg.eg.db, head(keys(org.Cg.eg.db)), "GO"))
'select()' returned 1:many mapping between keys and columns
      GID         GO
1 3979178 GO:0005739
2 3979178 GO:0008137
3 3979178 GO:0042773
4 3979179 GO:0016021
5 3979179 GO:0031966
6 3979179 GO:0070469

and a little more obscurely

> saveDb(org.Cg.eg.db, t <- tempfile())
> dplyr::src_sqlite(t)
src:  sqlite 3.8.6 [/tmp/RtmpFhFguZ/file2c7d5e1856ef]
tbls: accessions, alias, chromosomes, entrez_genes, gene_info, genes, go,
  go_all, map_counts, map_metadata, metadata, pubmed, refseq

-- these are OrgDb packages without the package. It's true that the annotations can be quite sketchy for many of these less common organisms.

ADD REPLY • link 8.3 years ago Martin Morgan 25k

0

Entering edit mode

Thanks Gordon; I understand your reasoning. I asked because I thought the AnnotationHub OrgDb objects had the same structure as the packages directly made available at Bioconductor.

ADD REPLY • link 8.3 years ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

I've now modified kegga() so you can use the GEO info from AnnotationHub, see above.

ADD REPLY • link 8.3 years ago Gordon Smyth 50k

0

Entering edit mode

I would like to add that the code above for the GO analyses only works if the data frame gene.go only contains genes that are all annotated with a GO term; if that is not the case (i.e. some gene don't have a GO annotation) the function kegga() stops with a rather cryptic error (see below).
This can be solved by removing all genes from gene.go that don't have a GO annotation. Maybe useful to add this check to the function kegga as well? [since even some human/mouse/rat genes don't have a GO annotation].

Please note that in my hands KEGG-based analyses always work fine!

> gene.go <- select(org.Cg.eg.db, keys(org.Cg.eg.db), c("GO","EVIDENCE","ONTOLOGY"))

#  this is a small modification to the code from Gordon: 'evidence' and 'ontology' are not required
# for kegga, but are nevertheless automatically returned when using human or mouse OrgDb.
# This is not the case for OrgCg, hence I manually add these to the output as well

> k <- kegga(fit3, coef=1, gene.pathway=gene.go, pathway.names=go.names)
Error in data.frame(Pathway = PathID.PathName[m, 2], S, P, stringsAsFactors = FALSE) :
  row names contain missing values
In addition: Warning message:
In rowsum.default(X, group = EG.KEGG[, 2], reorder = FALSE) :
  missing values for 'group'
>


> gene.go <- gene.go[!is.na(gene.go[, 2]),] # remove genes w/o annotation

> k <- kegga(fit3, coef=1, gene.pathway=gene.go, pathway.names=go.names)
#Works now!

> topKEGG(k, sort="up")
                                                                Pathway    N
GO:0008270                                             zinc ion binding  319
GO:0016021                               integral component of membrane 1198
GO:0046872                                            metal ion binding  328
GO:0005509                                          calcium ion binding  134
GO:0005622                                                intracellular  261
GO:0005634                                                      nucleus  501
GO:0005524                                                  ATP binding  365
GO:0003700 transcription factor activity, sequence-specific DNA binding  145
GO:0006351                                 transcription, DNA-templated  139
GO:0005737                                                    cytoplasm  219
GO:0003676                                         nucleic acid binding  287
GO:0016874                                              ligase activity   71
GO:0005525                                                  GTP binding  157
GO:0005975                               carbohydrate metabolic process   45
GO:0043565                                sequence-specific DNA binding  113
GO:0035556                            intracellular signal transduction   48
GO:0004871                                   signal transducer activity   41
GO:0030335                        positive regulation of cell migration    4
GO:0016020                                                     membrane  172
GO:0006486                                        protein glycosylation   36
            Up Down         P.Up       P.Down
GO:0008270  53   68 2.219773e-21 3.061412e-22
GO:0016021 109   85 1.661735e-20 1.203372e-02
GO:0046872  50   81 1.406914e-18 5.409508e-31
GO:0005509  29    6 2.404410e-15 7.631627e-01
GO:0005622  40   48 3.576527e-15 1.830593e-13
GO:0005634  56  139 1.674048e-14 5.784554e-60
GO:0005524  45  124 2.256955e-13 1.367218e-64
GO:0003700  24   23 2.477780e-10 5.194168e-06
GO:0006351  22   36 3.310297e-09 3.922423e-15
GO:0005737  27   80 1.534215e-08 2.145484e-44
GO:0003676  30   80 1.087000e-07 1.202599e-34
GO:0016874  14   20 1.458036e-07 1.017204e-09
GO:0005525  21   29 1.520062e-07 1.018438e-08
GO:0005975  11    6 3.203673e-07 3.745304e-02
GO:0043565  17   11 4.272635e-07 5.011862e-02
GO:0035556  11    9 6.478628e-07 1.166716e-03
GO:0004871  10    5 1.132683e-06 7.582424e-02
GO:0030335   4    0 1.519218e-06 1.000000e+00
GO:0016020  20   22 2.827689e-06 2.311119e-04
GO:0006486   9    5 3.137814e-06 4.771547e-02
>

ADD REPLY • link 8.3 years ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

Hi Guido. Thanks for the bug report. I have fixed this in limma 3.26.7.

ADD REPLY • link 8.3 years ago Gordon Smyth 50k

0

Entering edit mode

Hello,

I am trying to use C. elegans (Ce) org.Ce.eg.db for GO enrichment analysis. Even though it's an official package, and documentation states that it should work with goana - it doesn't.

Citing https://www.rdocumentation.org/packages/limma/versions/3.28.14/topics/goana?:

"goana uses annotation from the appropriate Bioconductor organism package. The species can be any character string XX for which an organism package org.XX.eg.db exists and is installed. See alias2Symbol for other possible values for species."

Running the function with 'Ce' results in:

go12<-goana(lrt12,species='Ce',geneid='EntrezGene',convert=TRUE,FDR=0.05)
Error in match.arg(species, c("Hs", "Mm", "Rn", "Dm", "Pt")) : 
  'arg' should be one of “Hs”, “Mm”, “Rn”, “Dm”, “Pt”

When I was debugging the function I have found this:

species <- match.arg(species, c("Hs", "Mm", "Rn", "Dm", "Pt"))

The available species list is hardwired, regardless of the installed .db packages. Is there a specific reason for that or is it a bug?

Best,

Povilas Norvaisas

ADD REPLY • link updated 7.3 years ago by Gordon Smyth 50k • written 7.3 years ago by Povilas Norvaisas • 0

0

Entering edit mode

You are obviously using a version of limma prior to version 3.28.12. Clearly you can't expect to use the latest features of the package if you don't install the current version.

Try typing sessionInfo().

In general, you need to read the documentation that comes with the version of limma that you are using (by typing ?goana), rather than reading documentation for a random version from the web. The documentation page you link to is neither the latest nor for the version you are using.

PS. Just as a hint for the future, it would have been better to have posted this as a new question rather than as comment on an old thread. There is nothing in the old thread to indicate that goana() should work with Ce, quite the opposite, the thread explained why goana() didn't work for Ce at the time.

ADD REPLY • link 7.3 years ago Gordon Smyth 50k