Question

Gene ontology and pathway analysis in R - replicate ToppGene features; best strategy for getting annotations?

0

Entering edit mode

Darya Vanichkina ▴ 120

@darya-vanichkina-6050

Last seen 7.2 years ago

Australia/Centenary Institute Universit…

I have been reading several vignettes for the wonderful TopGO, GOexpress, GAGE and other tools, and realised that I have a far more fundamental question:

What are the best practices for carrying out a gene ontology, pathway enrichment, gene set etc analysis in R?

Now, I know there are many ways of doing each of these (and that they are different!), but the basic outcome I am trying to achieve is to replicate the results of ToppGene ToppFun analysis - but using R, and with species other than human.

The things I'm looking to get, in order of importance (see the sample ToppGene output if any of these are not clear), are:

- GO: MF, BP, CC

- pathway

- gene family

- coexpression

- coexpression atlas

- Domain (if possible)

- TFBS

- miRNA regulation

- Interaction

- human/mouse phenotype

- disease prediction

- Pubmed IDs linked

My input is a standard list of differentially expressed genes, with a background of all genes expressed above a certain cutoff in the dataset; I'm using gencode ENSMG gene identifiers.

The database that ToppGene uses is comprised of the following resources: https://toppgene.cchmc.org/navigation/database.jsp

What is the best way to pull out as many similar annotations as I can using biomaRt or other tools? At the moment, even trying to pull out "ensembl_gene_id", "go_id","name_1006", "definition_1006", "namespace_1003", "goslim_goa_description", "goslim_goa_accession", "go_linkage_type" fails with errors, although I can get each of these pairs (gene_id: parameter) out individually, and then combine them /so it works, but is very, very inefficient in terms of how much code needs to be written and run sequentially).

Thanks in advance!

topgo biomart gene ontology goexpress pathway analysis • 3.6k views

ADD COMMENT • link updated 7.9 years ago by kevin.rue ▴ 350 • written 7.9 years ago by Darya Vanichkina ▴ 120

0

Entering edit mode

Not a direct answer to your question, so I post as comment:

Have you tried goseq yet? (you haven't listed it). Similarly to GOexpress (or rather the other way around ^^), it supports many organisms, and provides direct links to GO, while offering an interface for custom annotations. See section "Non-native Gene Identifier or category test" in http://bioconductor.org/packages/release/bioc/vignettes/goseq/inst/doc/goseq.pdf

I haven't re-checked many other packages recently, but the "custom categories" feature in those two is close to my heart.

Best wishes,
Kevin

ADD REPLY • link 7.9 years ago kevin.rue ▴ 350

score 0 · Answer 1 · 2016-05-18

Dear Darya,

Please read with caution and criticism, as my main focus is generally on GO and Pathways Analysis.

Best practice (imo)

Personally, I would say that "get[ting] each of these pairs (gene_id: parameter) out individually" is probably best practice, except if you truly use them all in a single analysis (combining all annotation types together somehow). However, if you are currently combining them solely for the purpose of having a single object with all types of annotations, I would point out that the biomaRt developers have good reason to restrict queries to a single feature_page, from a database/table perspective. As you found out, merging all those pieces of information is relatively inefficient, and (I would expect) more memory-hungry than separate tables (I would genuinely be curious to hear from you the difference between the memory size of the merged table, and the sum of all individual tables). In any case, if you wish to run a separate analysis for each type of information, I definitely recommend writing a different script for each, all sharing the format {download info - carry analysis - write results}. The added value of this is that, if you establish a pipeline for each type of data, you are getting relatively close to an R package in which you could wrap each of those pipeline in a master function "analyse(data, pipeline, ...)" that you could contribute to Bioconductor/CRAN to guide other people interested in any of those annotation types.

What is the best way to pull out as many similar annotations

I cannot remember whether they cover all of your annotation types, but I think you should check out the Downloads page of:

I hope that helps,

Best wishes,
Kevin