Question

biomaRT GO-ID retreival showing different genes than AmiGO

1

Entering edit mode

snamjoshi87 ▴ 40

@snamjoshi87-11184

Last seen 8.1 years ago

I am trying to retrieve all genes that match a particular GO-ID using biomaRt:

ensembl <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")

goGenes <- getBM(attributes = c("mgi_symbol", "go_id"), filters = "go_id", values = "GO:0098793", mart = ensembl)

nrow(goGenes)

This returns a value of 53. However, if you look at the AmiGO page for this GO term and filter for M. musculus, you see that there are actually 779 genes (384 when you remove duplicated MGI symbols).

For this GO term, the page shows 591 genes after duplicates are removed. But running the function above with this GO term returns 0 genes.

What am I doing wrong here? Why don't the numbers match up?

biomart go • 3.3k views

ADD COMMENT • link updated 8.3 years ago by Mike Smith ★ 6.6k • written 8.3 years ago by snamjoshi87 ▴ 40

score 4 · Accepted Answer · 2016-12-11

4

Entering edit mode

Mike Smith ★ 6.6k

@mike-smith

Last seen 9 hours ago

EMBL Heidelberg

This isn't a problem with the biomaRt package per se, as you get back the same values you find via accessing Ensembl biomart directly.

My instinct is that this query will return any gene that is directly annotated with that GO category. It won't find anything assigned to a child category. Is the same true for AmiGO? A brief look makes me think the list of genes on that site includes those from child nodes in the ontology.

Does the overlap improve if you query Ensembl with the parent term? e.g.

goGenes <- getBM(attributes = c("mgi_symbol", "go_id"),
                 filters = "go_parent_term",
                 values = "GO:0098793",
                 mart = ensembl)

ADD COMMENT • link 8.3 years ago Mike Smith ★ 6.6k

0

Entering edit mode

Sorry for the late response! This worked for me. Only thing to note is that if you pass multiple GO IDs to the values parameter it will not work the way I was intending in the question. Thanks!

ADD REPLY • link 8.1 years ago snamjoshi87 ▴ 40

0

Entering edit mode

What output are you hoping for when you supply multiple GO terms?

ADD REPLY • link 8.1 years ago Mike Smith ★ 6.6k

0

Entering edit mode

I just realized I never really specified in my question what output I wanted. If you run the code you have supplied above, you get the genes for all child terms. If you use multiple genes, you will get a combination of all child terms for all the parent GO terms you supplied which makes sense. But then, you have no way of knowing which child term is associated with what parent process. It's all lumped together. Ideally, I could search for a bunch of parent terms and there would be another column indicating what parent term a given child term is associated with. I got around this by just creating a function that accepts GO terms and uses rbind() to combine each output together with a separate column identifying what the original GO parent term was. There could be a better way though.

ADD REPLY • link 8.1 years ago snamjoshi87 ▴ 40