Question: biomaRt returns all NAs for hgnc_symbol
0
6 months ago by
foehn60
foehn60 wrote:

Hello,

I'm trying to map mouse symbols to human, using R package biomaRt. Here is my code.

bm <- useMart(biomart = 'ensembl', dataset = "mmusculus_gene_ensembl")
> SymbolMap <- getBM(attributes = c("mgi_symbol", "hgnc_symbol", "ensembl_gene_id"), filters = "mgi_symbol", mart = bm, value = symbols)
> dim(SymbolMap)
[1] 22154     3
mgi_symbol hgnc_symbol    ensembl_gene_id
1 0610005C13Rik          NA ENSMUSG00000109644
2 0610009B22Rik          NA ENSMUSG00000007777
3 0610009L18Rik          NA ENSMUSG00000043644
4 0610010F05Rik          NA ENSMUSG00000042208
5 0610010K14Rik          NA ENSMUSG00000020831
6 0610012G03Rik          NA ENSMUSG00000107002

> sumis.na(SymbolMap[, "hgnc_symbol"]))
[1] 22154
> allis.na(SymbolMap[, "hgnc_symbol"]))
[1] TRUE
> anyis.na(SymbolMap[, "ensembl_gene_id"]))
[1] FALSE
> length(unique(SymbolMap[, "ensembl_gene_id"]))
[1] 22149

packageVersion("biomaRt")
[1] ‘2.38.0’


To my surprise, none of the mouse symbols get mapped to human. However, obviously the input mouse symbols can be mapped to 22149 Ensembl gene IDs, which means my input should be no problem. So, I'm confused by the results and want to see if anybody has similar issue. Thanks.

biomart • 236 views
modified 6 months ago by Mike Smith4.0k • written 6 months ago by foehn60
1

...and it should not be any value other than NA because HGNC is for human gene nomenclature.

sumis.na should be sum（is.na, and same for allis.na and anyis.na. Don't know why they are shown differently from the preview...

Answer: biomaRt returns all NAs for hgnc_symbol
2
6 months ago by
swbarnes2330
swbarnes2330 wrote:

Not like this. You want to find human orthologs for those mouse symbols. You might need to go symbol -> ensembl ID -> human orthologs (which won't be 1-1)

Answer: biomaRt returns all NAs for hgnc_symbol
2
6 months ago by
Mike Smith4.0k
EMBL Heidelberg / de.NBI
Mike Smith4.0k wrote:

It seems you've stumbled across a combination of attributes that results in an invalid query. If you try running your same query in the Ensembl BioMart web interface you get back the following:

Validation Error: Too many attributes selected for External References

I don't know of any way for biomaRt to check for this, but I suspect whatever the issue is server-side is why you're seeing the complete set of NA values.

As for why it's happening, one possible reason is that this is a case where that attribute name is really misleading. There's very little documentation, but I think this field is only populated for poorly annotated genes that don't have an MGI symbol but have been assigned some speculative HGNC ortholog e.g. SPATA24 If that's the case then your query, which explicitly selects genes with MGI symbols, would only ever return results with no value assigned to this field - hence the NAs

I assume you actually want to find the set of orthologous human genes for your starting set of MGI symbols. If that's the case, here's one approach to finding the HGNC symbols for orthologous genes. First we'll load the library, initialise the mart, and list some example MGI gene symbols:

library(biomaRt)
symbols <- c("0610005C13Rik", "Cdc6", "Gfap")
bm <- useMart(biomart = 'ensembl', dataset = "mmusculus_gene_ensembl")


Next we get the table of mappings between MGI and Ensembl IDs:

mgi2ensembl <- getBM(attributes = c("mgi_symbol", "ensembl_gene_id"),
filters = "mgi_symbol",
mart = bm,
value = symbols)


We then ask for all human orthologs for those Ensembl IDs. As this is an Ensembl dataset you have to use Ensembl IDs as the primary key here.

ensembl2hgnc <- getBM(attributes = c("hsapiens_homolog_associated_gene_name", "ensembl_gene_id"),
filters = "ensembl_gene_id",
mart = bm,
value = mgi2ensembl\$ensembl_gene_id)


Finally we merge our two results into a single table to get the final mapping. A blank value indicates no ortholog was reported in Ensembl.

> merge(mgi2ensembl, ensembl2hgnc)
ensembl_gene_id    mgi_symbol hsapiens_homolog_associated_gene_name
1 ENSMUSG00000017499          Cdc6                                  CDC6
2 ENSMUSG00000020932          Gfap                                  GFAP
3 ENSMUSG00000109644 0610005C13Rik


There are several ways you can do this biomaRt and I wouldn't be surprised if they came up with slightly different results as mapping between gene symbols/annotation within an organism is fraught with oddities, as is defining orthologs, but they should be broadly similar.