I have been reading several vignettes for the wonderful TopGO, GOexpress, GAGE and other tools, and realised that I have a far more fundamental question:
What are the best practices for carrying out a gene ontology, pathway enrichment, gene set etc analysis in R?
Now, I know there are many ways of doing each of these (and that they are different!), but the basic outcome I am trying to achieve is to replicate the results of ToppGene ToppFun analysis - but using R, and with species other than human.
The things I'm looking to get, in order of importance (see the sample ToppGene output if any of these are not clear), are:
- GO: MF, BP, CC
- gene family
- coexpression atlas
- Domain (if possible)
- miRNA regulation
- human/mouse phenotype
- disease prediction
- Pubmed IDs linked
My input is a standard list of differentially expressed genes, with a background of all genes expressed above a certain cutoff in the dataset; I'm using gencode ENSMG gene identifiers.
The database that ToppGene uses is comprised of the following resources: https://toppgene.cchmc.org/navigation/database.jsp
What is the best way to pull out as many similar annotations as I can using biomaRt or other tools? At the moment, even trying to pull out "ensembl_gene_id", "go_id","name_1006", "definition_1006", "namespace_1003", "goslim_goa_description", "goslim_goa_accession", "go_linkage_type" fails with errors, although I can get each of these pairs (gene_id: parameter) out individually, and then combine them /so it works, but is very, very inefficient in terms of how much code needs to be written and run sequentially).
Thanks in advance!