Question: Ensemble and Entrez disconnected?
0
6 months ago by
bettina.budeus0 wrote:

I'm getting quite crazy here, and maybe this is a easy to solve problem:

I have RNA-seq data with Ensembl rownames and want to map them to symbol and entrez, because I need them for other packages in the one or other format.

But there seems to be a disconnection of Ensembl and Entrez in both biomart and org.hs.eg.db for some genes. For example, if I want to map the IGHM gene, which has the Ensembl ID ENSG00000211899 and the Enrez 3507, I can only map both to symbol, but not to each other:

for biomart:

annot <- getBM(mart=mart, attributes=c("ensembl_gene_id", "external_gene_name", "entrezgene"), filter="ensembl_gene_id", values="ENSG00000211899", uniqueRows=TRUE)

this returns the gene name, but no entrez.

for org.hs.eg.db:

select(org.Hs.eg.db, keys='3507', columns=c('SYMBOL','ENSEMBL'))

this returns again the Symbol, but this time no Ensembl id.

Of course, I can go out and try to map those symbols from biomart to the symbols from org.db, but this is not as it should be, right?

Both, biomart and org.db are up to date (2.38 and 3.7)

modified 6 months ago by James W. MacDonald50k • written 6 months ago by bettina.budeus0
1
6 months ago by
United States
James W. MacDonald50k wrote:

Anytime you try to map between the annotation services, you will run into literally thousands of things that don't make sense to you. If you add up all the Ensembl IDs that map to NCBI Gene IDs, you will be disappointed because it's somewhere around 50%. And similarly if you go the other way.

Anyway, if you go to the Ensembl website for this gene, you will find that they do say that 3507 matches. But if you use the Biomart server, or the EnsDb that Johannes Rainer makes for Bioconductor, you see that they maybe don't think there is a matching Gene ID:

> select(ensdb, "ENSG00000211899", "ENTREZID", "GENEID")
GENEID ENTREZID
1 ENSG00000211899       NA

And if you go to the NCBI website you will find that under 'See related' there is no Ensembl ID listed there. And if you query the NCBI data that we get from them, they say the gene is maybe on chr14, but otherwise maybe not?

> select(org.Hs.eg.db, "3507", c("CHR","CHRLOC"))
'select()' returned 1:1 mapping between keys and columns
ENTREZID CHR CHRLOC CHRLOCCHR
1     3507  14     NA      <NA>

> select(Homo.sapiens, "3507", c("CDSCHROM","CDSSTART","CDSEND"), "ENTREZID")
'select()' returned 1:1 mapping between keys and columns
ENTREZID CDSCHROM CDSSTART CDSEND
1     3507     <NA>       NA     NA

> select(Homo.sapiens, "3507", c("TXCHROM","TXSTART","TXEND"), "ENTREZID")
'select()' returned 1:1 mapping between keys and columns
ENTREZID TXCHROM TXSTART TXEND
1     3507    <NA>      NA    NA

Which is a roundabout way to say that mapping genes to the genome and annotating them is not a trivial task, and then trying to say which genes one group has annotated are 'the same' as the genes you have annotated requires you to define what 'the same' means, and usually that requires pretty inflexible rules that result in fewer cross-service mappings than an uninformed observer might think there should be.