Question: biomaRt returns all NAs for hgnc_symbol
0
gravatar for foehn
4 weeks ago by
foehn60
foehn60 wrote:

Hello,

I'm trying to map mouse symbols to human, using R package biomaRt. Here is my code.

bm <- useMart(biomart = 'ensembl', dataset = "mmusculus_gene_ensembl")
> SymbolMap <- getBM(attributes = c("mgi_symbol", "hgnc_symbol", "ensembl_gene_id"), filters = "mgi_symbol", mart = bm, value = symbols)
> dim(SymbolMap)
[1] 22154     3
> head(SymbolMap)                                                     
     mgi_symbol hgnc_symbol    ensembl_gene_id
1 0610005C13Rik          NA ENSMUSG00000109644
2 0610009B22Rik          NA ENSMUSG00000007777
3 0610009L18Rik          NA ENSMUSG00000043644
4 0610010F05Rik          NA ENSMUSG00000042208
5 0610010K14Rik          NA ENSMUSG00000020831
6 0610012G03Rik          NA ENSMUSG00000107002

> sumis.na(SymbolMap[, "hgnc_symbol"]))
[1] 22154
> allis.na(SymbolMap[, "hgnc_symbol"]))
[1] TRUE
> anyis.na(SymbolMap[, "ensembl_gene_id"]))
[1] FALSE
> length(unique(SymbolMap[, "ensembl_gene_id"]))
[1] 22149

packageVersion("biomaRt")
[1] ‘2.38.0’

To my surprise, none of the mouse symbols get mapped to human. However, obviously the input mouse symbols can be mapped to 22149 Ensembl gene IDs, which means my input should be no problem. So, I'm confused by the results and want to see if anybody has similar issue. Thanks.

biomart • 129 views
ADD COMMENTlink modified 4 weeks ago by Mike Smith3.6k • written 4 weeks ago by foehn60
1

...and it should not be any value other than NA because HGNC is for human gene nomenclature.

ADD REPLYlink written 4 weeks ago by Kevin Blighe90

sumis.na should be sum(is.na, and same for allis.na and anyis.na. Don't know why they are shown differently from the preview...

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by foehn60
Answer: biomaRt returns all NAs for hgnc_symbol
2
gravatar for swbarnes2
4 weeks ago by
swbarnes2170
swbarnes2170 wrote:

Not like this. You want to find human orthologs for those mouse symbols. You might need to go symbol -> ensembl ID -> human orthologs (which won't be 1-1)

ADD COMMENTlink written 4 weeks ago by swbarnes2170
Answer: biomaRt returns all NAs for hgnc_symbol
2
gravatar for Mike Smith
4 weeks ago by
Mike Smith3.6k
EMBL Heidelberg / de.NBI
Mike Smith3.6k wrote:

It seems you've stumbled across a combination of attributes that results in an invalid query. If you try running your same query in the Ensembl BioMart web interface you get back the following:

Validation Error: Too many attributes selected for External References

I don't know of any way for biomaRt to check for this, but I suspect whatever the issue is server-side is why you're seeing the complete set of NA values.

As for why it's happening, one possible reason is that this is a case where that attribute name is really misleading. There's very little documentation, but I think this field is only populated for poorly annotated genes that don't have an MGI symbol but have been assigned some speculative HGNC ortholog e.g. SPATA24 If that's the case then your query, which explicitly selects genes with MGI symbols, would only ever return results with no value assigned to this field - hence the NAs

I assume you actually want to find the set of orthologous human genes for your starting set of MGI symbols. If that's the case, here's one approach to finding the HGNC symbols for orthologous genes. First we'll load the library, initialise the mart, and list some example MGI gene symbols:

library(biomaRt)
symbols <- c("0610005C13Rik", "Cdc6", "Gfap")
bm <- useMart(biomart = 'ensembl', dataset = "mmusculus_gene_ensembl")

Next we get the table of mappings between MGI and Ensembl IDs:

mgi2ensembl <- getBM(attributes = c("mgi_symbol", "ensembl_gene_id"), 
    filters = "mgi_symbol", 
    mart = bm, 
    value = symbols)

We then ask for all human orthologs for those Ensembl IDs. As this is an Ensembl dataset you have to use Ensembl IDs as the primary key here.

ensembl2hgnc <- getBM(attributes = c("hsapiens_homolog_associated_gene_name", "ensembl_gene_id"), 
    filters = "ensembl_gene_id", 
    mart = bm, 
    value = mgi2ensembl$ensembl_gene_id)

Finally we merge our two results into a single table to get the final mapping. A blank value indicates no ortholog was reported in Ensembl.

> merge(mgi2ensembl, ensembl2hgnc)
     ensembl_gene_id    mgi_symbol hsapiens_homolog_associated_gene_name
1 ENSMUSG00000017499          Cdc6                                  CDC6
2 ENSMUSG00000020932          Gfap                                  GFAP
3 ENSMUSG00000109644 0610005C13Rik 

There are several ways you can do this biomaRt and I wouldn't be surprised if they came up with slightly different results as mapping between gene symbols/annotation within an organism is fraught with oddities, as is defining orthologs, but they should be broadly similar.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Mike Smith3.6k

Thanks to the detailed answer. Understood what you and @swbarnes2 mean. But it's a bit weird that the same code worked years ago...

ADD REPLYlink written 4 weeks ago by foehn60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 134 users visited in the last hour