I am using limma as my standard work horse to analyze microarray data. I noticed that since the last release limma also is able to perform GO and KEGG overrepresentation analyses (functions kegga and goanna). However, input for these functions is currently limited to a few species; from the help page: species: species identifier. Possible values are "Hs", "Mm", "Rn" or "Dm".
Since I regularly analyze data from species other than these four, I wondered whether it would be possible to extend these list of species, ideally allowing any species for which an org.Xx.eg.db and/or KEGG annotations are available. FYI: when using the new AnnotationHub infrastructure apparently more than 1000 OrgDbs are available.
Alternatively, it would be very handy if an additional argument could be provided to goana to specify the 2-letter species identifier, and to kegga to "manually map" the 2-letter species ID to the 3-letter KEGG species identifier. But maybe there are better /easier solutions...
Background: I am currently working with an Affymetrix Chinese Hamster (Cricetulus griseus) dataset; its corresponding OrgDb is available using AnnotationHub, and pathway info for this species is also available at KEGG (thus: "Cg" = "cge").
Thanks,
Guido
> library(AnnotationHub)
> hub = AnnotationHub()
> query(hub, c("OrgDb"))
AnnotationHub with 1019 records
# snapshotDate(): 2015-12-29
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Escherichia coli, Acanthamoeba castellanii_str._Neff, Acanthisit...
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
# sourcetype
# retrieve records with, e.g., 'object[["AH48006"]]'
title
AH48006 | org.Camelina_sativa.eg.sqlite
AH48007 | org.Glycine_max.eg.sqlite
AH48008 | org.Malus_domestica.eg.sqlite
AH48009 | org.Zea_mays.eg.sqlite
AH48010 | org.Brassica_rapa.eg.sqlite
... ...
AH49587 | org.Ce.eg.db.sqlite
AH49588 | org.Xl.eg.db.sqlite
AH49589 | org.Sc.sgd.db.sqlite
AH49590 | org.Dr.eg.db.sqlite
AH49591 | org.Pf.plasmo.db.sqlite
>
> # for Chinese hamster
># AH48061 | org.Cricetulus_griseus.eg.sqlite
> org.Cg.eg.db <- hub[["AH48061"]]
> org.Cg.eg.db
OrgDb object:
| DBSCHEMAVERSION: 2.1
| DBSCHEMA: NOSCHEMA_DB
| ORGANISM: Cricetulus griseus
| SPECIES: Cricetulus griseus
| CENTRALID: GID
| Taxonomy ID: 10029
| Db type: OrgDb
| Supporting package: AnnotationDbi
Please see: help('select') for usage information
>

FWIW the AnnotationHub OrgDb objects obey the usual interface, e.g.,
> columns(org.Cg.eg.db) [1] "ACCNUM" "ALIAS" "CHR" "ENTREZID" "EVIDENCE" [6] "EVIDENCEALL" "GENENAME" "GID" "GO" "GOALL" [11] "ONTOLOGY" "ONTOLOGYALL" "PMID" "REFSEQ" "SYMBOL" > head(keys(org.Cg.eg.db, "GO")) ## GO identifiers [1] "GO:0000002" "GO:0000012" "GO:0000014" "GO:0000015" "GO:0000028" [6] "GO:0000030" > head(keys(org.Cg.eg.db)) ## Primary keys [1] "3979178" "3979179" "3979180" "3979181" "3979182" "3979183" > head(select(org.Cg.eg.db, head(keys(org.Cg.eg.db)), "GO")) 'select()' returned 1:many mapping between keys and columns GID GO 1 3979178 GO:0005739 2 3979178 GO:0008137 3 3979178 GO:0042773 4 3979179 GO:0016021 5 3979179 GO:0031966 6 3979179 GO:0070469and a little more obscurely
-- these are OrgDb packages without the package. It's true that the annotations can be quite sketchy for many of these less common organisms.
Thanks Gordon; I understand your reasoning. I asked because I thought the AnnotationHub OrgDb objects had the same structure as the packages directly made available at Bioconductor.
I've now modified kegga() so you can use the GEO info from AnnotationHub, see above.
I would like to add that the code above for the GO analyses only works if the data frame
gene.goonly contains genes that are all annotated with a GO term; if that is not the case (i.e. some gene don't have a GO annotation) the functionkegga()stops with a rather cryptic error (see below).This can be solved by removing all genes from
gene.gothat don't have a GO annotation. Maybe useful to add this check to the functionkeggaas well? [since even some human/mouse/rat genes don't have a GO annotation].Please note that in my hands KEGG-based analyses always work fine!
> gene.go <- select(org.Cg.eg.db, keys(org.Cg.eg.db), c("GO","EVIDENCE","ONTOLOGY")) # this is a small modification to the code from Gordon: 'evidence' and 'ontology' are not required # for kegga, but are nevertheless automatically returned when using human or mouse OrgDb. # This is not the case for OrgCg, hence I manually add these to the output as well> k <- kegga(fit3, coef=1, gene.pathway=gene.go, pathway.names=go.names) Error in data.frame(Pathway = PathID.PathName[m, 2], S, P, stringsAsFactors = FALSE) : row names contain missing values In addition: Warning message: In rowsum.default(X, group = EG.KEGG[, 2], reorder = FALSE) : missing values for 'group' > > gene.go <- gene.go[!is.na(gene.go[, 2]),] # remove genes w/o annotation > k <- kegga(fit3, coef=1, gene.pathway=gene.go, pathway.names=go.names) #Works now! > topKEGG(k, sort="up") Pathway N GO:0008270 zinc ion binding 319 GO:0016021 integral component of membrane 1198 GO:0046872 metal ion binding 328 GO:0005509 calcium ion binding 134 GO:0005622 intracellular 261 GO:0005634 nucleus 501 GO:0005524 ATP binding 365 GO:0003700 transcription factor activity, sequence-specific DNA binding 145 GO:0006351 transcription, DNA-templated 139 GO:0005737 cytoplasm 219 GO:0003676 nucleic acid binding 287 GO:0016874 ligase activity 71 GO:0005525 GTP binding 157 GO:0005975 carbohydrate metabolic process 45 GO:0043565 sequence-specific DNA binding 113 GO:0035556 intracellular signal transduction 48 GO:0004871 signal transducer activity 41 GO:0030335 positive regulation of cell migration 4 GO:0016020 membrane 172 GO:0006486 protein glycosylation 36 Up Down P.Up P.Down GO:0008270 53 68 2.219773e-21 3.061412e-22 GO:0016021 109 85 1.661735e-20 1.203372e-02 GO:0046872 50 81 1.406914e-18 5.409508e-31 GO:0005509 29 6 2.404410e-15 7.631627e-01 GO:0005622 40 48 3.576527e-15 1.830593e-13 GO:0005634 56 139 1.674048e-14 5.784554e-60 GO:0005524 45 124 2.256955e-13 1.367218e-64 GO:0003700 24 23 2.477780e-10 5.194168e-06 GO:0006351 22 36 3.310297e-09 3.922423e-15 GO:0005737 27 80 1.534215e-08 2.145484e-44 GO:0003676 30 80 1.087000e-07 1.202599e-34 GO:0016874 14 20 1.458036e-07 1.017204e-09 GO:0005525 21 29 1.520062e-07 1.018438e-08 GO:0005975 11 6 3.203673e-07 3.745304e-02 GO:0043565 17 11 4.272635e-07 5.011862e-02 GO:0035556 11 9 6.478628e-07 1.166716e-03 GO:0004871 10 5 1.132683e-06 7.582424e-02 GO:0030335 4 0 1.519218e-06 1.000000e+00 GO:0016020 20 22 2.827689e-06 2.311119e-04 GO:0006486 9 5 3.137814e-06 4.771547e-02 >Hi Guido. Thanks for the bug report. I have fixed this in limma 3.26.7.
Hello,
I am trying to use C. elegans (Ce) org.Ce.eg.db for GO enrichment analysis. Even though it's an official package, and documentation states that it should work with goana - it doesn't.
Citing https://www.rdocumentation.org/packages/limma/versions/3.28.14/topics/goana?:
"goanauses annotation from the appropriate Bioconductor organism package. Thespeciescan be any character string XX for which an organism package org.XX.eg.db exists and is installed. Seealias2Symbolfor other possible values forspecies."Running the function with 'Ce' results in:
go12<-goana(lrt12,species='Ce',geneid='EntrezGene',convert=TRUE,FDR=0.05) Error in match.arg(species, c("Hs", "Mm", "Rn", "Dm", "Pt")) : 'arg' should be one of “Hs”, “Mm”, “Rn”, “Dm”, “Pt”When I was debugging the function I have found this:
species <- match.arg(species, c("Hs", "Mm", "Rn", "Dm", "Pt"))The available species list is hardwired, regardless of the installed .db packages. Is there a specific reason for that or is it a bug?
Best,
Povilas Norvaisas
You are obviously using a version of limma prior to version 3.28.12. Clearly you can't expect to use the latest features of the package if you don't install the current version.
Try typing sessionInfo().
In general, you need to read the documentation that comes with the version of limma that you are using (by typing ?goana), rather than reading documentation for a random version from the web. The documentation page you link to is neither the latest nor for the version you are using.
PS. Just as a hint for the future, it would have been better to have posted this as a new question rather than as comment on an old thread. There is nothing in the old thread to indicate that goana() should work with Ce, quite the opposite, the thread explained why goana() didn't work for Ce at the time.