Search
Question: biomaRT GO-ID retreival showing different genes than AmiGO
0
gravatar for snamjoshi87
11 months ago by
snamjoshi8710
snamjoshi8710 wrote:

I am trying to retrieve all genes that match a particular GO-ID using biomaRt:

ensembl <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")

goGenes <- getBM(attributes = c("mgi_symbol", "go_id"),
                 filters = "go_id",
                 values = "GO:0098793",
                 mart = ensembl)

nrow(goGenes)

This returns a value of 53. However, if you look at the AmiGO page for this GO term and filter for M. musculus, you see that there are actually 779 genes (384 when you remove duplicated MGI symbols).

For this GO term, the page shows 591 genes after duplicates are removed. But running the function above with this GO term returns 0 genes.

What am I doing wrong here? Why don't the numbers match up?

ADD COMMENTlink modified 11 months ago by Mike Smith2.1k • written 11 months ago by snamjoshi8710
3
gravatar for Mike Smith
11 months ago by
Mike Smith2.1k
EMBL Heidelberg / de.NBI
Mike Smith2.1k wrote:

This isn't a problem with the biomaRt package per se, as you get back the same values you find via accessing Ensembl biomart directly.  

My instinct is that this query will return any gene that is directly annotated with that GO category.  It won't find anything assigned to a child category.  Is the same true for AmiGO?  A brief look makes me think the list of genes on that site includes those from child nodes in the ontology.

Does the overlap improve if you query Ensembl with the parent term? e.g.

goGenes <- getBM(attributes = c("mgi_symbol", "go_id"),
                 filters = "go_parent_term",
                 values = "GO:0098793",
                 mart = ensembl)

 

ADD COMMENTlink modified 11 months ago • written 11 months ago by Mike Smith2.1k

Sorry for the late response! This worked for me. Only thing to note is that if you pass multiple GO IDs to the values parameter it will not work the way I was intending in the question. Thanks!

ADD REPLYlink modified 8 months ago • written 8 months ago by snamjoshi8710

What output are you hoping for when you supply multiple GO terms?

ADD REPLYlink modified 8 months ago • written 8 months ago by Mike Smith2.1k

I just realized I never really specified in my question what output I wanted. If you run the code you have supplied above, you get the genes for all child terms. If you use multiple genes, you will get a combination of all child terms for all the parent GO terms you supplied which makes sense. But then, you have no way of knowing which child term is associated with what parent process. It's all lumped together. Ideally, I could search for a bunch of parent terms and there would be another column indicating what parent term a given child term is associated with. I got around this by just creating a function that accepts GO terms and uses rbind() to combine each output together with a separate column identifying what the original GO parent term was. There could be a better way though.

ADD REPLYlink modified 8 months ago • written 8 months ago by snamjoshi8710
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 212 users visited in the last hour