Search
Question: Gene ontology and pathway analysis in R - replicate ToppGene features; best strategy for getting annotations?
0
2.6 years ago by
Australia/Centenary Institute University of Sydney
Darya Vanichkina100 wrote:

I have been reading several vignettes for the wonderful TopGO, GOexpress, GAGE and other tools, and realised that I have a far more fundamental question:

What are the best practices for carrying out a gene ontology, pathway enrichment, gene set etc analysis in R?

Now, I know there are many ways of doing each of these (and that they are different!), but the basic outcome I am trying to achieve is to replicate the results of ToppGene ToppFun analysis - but using R, and with species other than human.

The things I'm looking to get, in order of importance (see the sample ToppGene output if any of these are not clear), are:

- GO: MF, BP, CC

- pathway

- gene family

- coexpression

- coexpression atlas

- Domain (if possible)

- TFBS

- miRNA regulation

- Interaction

- human/mouse phenotype

- disease prediction

My input is a standard list of differentially expressed genes, with a background of all genes expressed above a certain cutoff in the dataset; I'm using gencode ENSMG gene identifiers.

The database that ToppGene uses is comprised of the following resources: https://toppgene.cchmc.org/navigation/database.jsp

What is the best way to pull out as many similar annotations as I can using biomaRt or other tools? At the moment, even trying to pull out "ensembl_gene_id", "go_id","name_1006", "definition_1006", "namespace_1003", "goslim_goa_description", "goslim_goa_accession", "go_linkage_type" fails with errors, although I can get each of these pairs (gene_id: parameter) out individually, and then combine them /so it works, but is very, very inefficient in terms of how much code needs to be written and run sequentially).

modified 2.6 years ago by kevin.rue220 • written 2.6 years ago by Darya Vanichkina100

Not a direct answer to your question, so I post as comment:

Have you tried goseq yet? (you haven't listed it). Similarly to GOexpress (or rather the other way around ^^), it supports many organisms, and provides direct links to GO, while offering an interface for custom annotations. See section "Non-native Gene Identifier or category test" in  http://bioconductor.org/packages/release/bioc/vignettes/goseq/inst/doc/goseq.pdf

I haven't re-checked many other packages recently, but the "custom categories" feature in those two is close to my heart.

Best wishes,
Kevin

0
2.6 years ago by
kevin.rue220
University of Oxford
kevin.rue220 wrote:

Dear Darya,

Please read with caution and criticism, as my main focus is generally on GO and Pathways Analysis.

Best practice (imo)

Personally, I would say that "get[ting] each of these pairs (gene_id: parameter) out individually" is probably best practice, except if you truly use them all in a single analysis (combining all annotation types together somehow). However, if you are currently combining them solely for the purpose of having a single object with all types of annotations, I would point out that the biomaRt developers have good reason to restrict queries to a single feature_page, from a database/table perspective. As you found out, merging all those pieces of information is relatively inefficient, and (I would expect) more memory-hungry than separate tables (I would genuinely be curious to hear from you the difference between the memory size of the merged table, and the sum of all individual tables). In any case, if you wish to run a separate analysis for each type of information, I definitely recommend writing a different script for each, all sharing the format {download info - carry analysis - write results}. The added value of this is that, if you establish a pipeline for each type of data, you are getting relatively close to an R package in which you could wrap each of those pipeline in a master function "analyse(data, pipeline, ...)" that you could contribute to Bioconductor/CRAN to guide other people interested in any of those annotation types.

What is the best way to pull out as many similar annotations

I cannot remember whether they cover all of your annotation types, but I think you should check out the Downloads page of:

I hope that helps,

Best wishes,
Kevin

Hi Kevin,

I realised as I wrote my question, I think, that my main problem was how do I build a really awesome database of everything that I want to test for (and that ToppGene tests for me) in R...

The challenge with DAVID is that, I believe, the annotations are quite old and not routinely updated. So you're relying on annotations from 9 (nine!) years ago - given the massively rapid pace of science, you're probably missing 80%+ of what we know about the different genes.

Not sure about Panther. And all of the other sources of data, such as TFBS? I'm amazed that there are web resources like ToppGene, but no one has put together anything remotely similar for R  (which is what it seems like you're telling me).