Question: missing gene_ids in org.Mm.egENSEMBL
0
4.0 years ago by
Gregulator30
Australia
Gregulator30 wrote:

I'm have been trying to annotate some RNAseq data using the org.Mm.eg.db. The count matrix I was sent by collaborators has ENSEMBL gene IDs. However, I have been having a problem with missing gene ids in the egENSEMBL table when I try to annotate. For example, one gene I am interested in, H19, has the gene_id 14955, but this id does not seem to be present in egENSEMBL. On the other hand 14955 is present in egSYMBOL. Is there something basic I am missing? Is there a different table I should be using?

Thank you,

Greg

annotation org.mm.eg.db • 607 views
modified 4.0 years ago by James W. MacDonald51k • written 4.0 years ago by Gregulator30
0
4.0 years ago by
United States
James W. MacDonald51k wrote:

It's not clear what you are after. If you search ensembl.org for 14955, you get a page that indicates there is no Ensembl ID for this gene, so I am not sure how you even had this gene in your set of Ensembl gene IDs. Or are you saying that you expect H19 to be in your list of Ensembl genes and it's not there? If so, there's your answer.

Anyway, there are any number of genes that can be found in one or more of NCBI's databases that cannot be found in EBI's databases, and vice versa. This is particularly true for non-coding RNA.

Sorry. I was realize I was very confusing. I'll try to clarify using the H19 as an example. When I search for H19 on the Ensembl website I find that H19’s Ensembl gene ID is ENSMUSG00000000031 and the Entrez gene ID as 14955. When I search through the egENSEMBL table, the ENSEMBL gene ID ENSMUSG00000000031 is not present. However, when I search through the egSYMBOL table, I find that H19 is present and has the Entrez gene ID 14955. See below for the exact commands I used to search through the tables

> library(org.Mm.eg.db)

> egENSEMBL <- toTable(org.Mm.egENSEMBL)

Then I wrote this table to a text file and searched for ENSMUSG00000000031.

 gene_id ensembl_id 14679 ENSMUSG00000000001 54192 ENSMUSG00000000003 12544 ENSMUSG00000000028 107815 ENSMUSG00000000037 11818 ENSMUSG00000000049 67608 ENSMUSG00000000056

As you can see there is no entry for H19 in this table. However, when I search through the egSYMBOL table I find there is an entry for H19

> egSYMBOL <- toTable(org.Mm.egSYMBOL)

Then I wrote this table to a text file and searched for 14955.

 gene_id symbol 14944 Gzmg 14945 Gzmk 14950 H13 14955 H19 14957 Hist1h1d 14958 H1f0 14960 H2-Aa

So my question is, why is the gene entry for H19 missing from the egENSEMBL table? Have I done something wrong?

No, it doesn't say that 14955 is the matching gene. It says something else:

Overlapping RefSeq Gene ID 14955 matches but different biotype of misc_RNA

So you are saying 'these things are the same', and both Ensembl and NCBI are saying, 'well, not really'. So this gets back to what the org.Xx.eg.db packages are; simply a reformulation of data from NCBI, without interpretation on our part, and in particular based on mappings, starting with NCBI's Gene database. If EBI and NCBI say that the gene is in the same place, but is not the same thing, exactly, then we won't map 14955 to ENSMUSG00000000031, because NCBI doesn't.

And no, you haven't done anything wrong. Like I said before, when you have two different groups doing essentially the same thing, there are bound to be things that are not completely consistent between the two. And if you look at things from Ensembl's standpoint, they agree to disagree as well:

> getBM(c("ensembl_gene_id","mgi_symbol", "entrezgene"), "ensembl_gene_id", "ENSMUSG00000000031", mart)
ensembl_gene_id mgi_symbol entrezgene
1 ENSMUSG00000000031        H19         NA