Question

convert featureCounts's gene ID to entrez gene id

0

Entering edit mode

YinCY ▴ 20

@yincy-17934

Last seen 6.5 years ago

HangZhou, zhejaing

I'm using featureCounts (from Rsubread R/Bioconductor package) and gencode annotation file to do features sumarization, but the gene ids are as below, how can i convert those ID to entrez gene ids.

[1] "ENSMUSG00000102693.1" "ENSMUSG00000064842.1" "ENSMUSG00000051951.5"

[4] "ENSMUSG00000102851.1" "ENSMUSG00000103377.1" "ENSMUSG00000104017.1"

and i'm using BiomaRt R package to convert the ids like this

mart <- useMart(biomart = 'ensembl', dataset = 'mmusculus_gene_ensembl')

genes$entrez <- select(x = mart,
keys = as.character(genes$ensembl),
keytype = 'ensembl_gene_id_version',
column = 'entrezgene')

but it does't works!

biomart rsubread gencode • 8.0k views

ADD COMMENT • link updated 7.3 years ago by James W. MacDonald 68k • written 7.3 years ago by YinCY ▴ 20

0

Entering edit mode

This is a question about biomaRt so I have added biomaRt as a tag.

ADD REPLY • link 7.3 years ago Gordon Smyth 53k

0

Entering edit mode

ok, thanks.

ADD REPLY • link 7.3 years ago YinCY ▴ 20

score 2 · Answer 1 · 2018-10-22

If you want to work with Entrez Gene IDs, then it would be simpler and better to use featureCounts with Rsubread's in-built mouse annotation in the first place instead of Gencode. It is very fast and easy to do that. Then you would have Entrez Gene IDs directly and you would get a count for every possible Entrez Gene ID.

If you have some strong reason to use Gencode annotation, but want Entrez Gene Ids as well, then I would use Genecode annotation directly, which you can download from the Gencode website. BiomaRt can only give you Ensembl mappings, whereas Gencode is a combination of Ensembl plus other annotation sources.

Whether you use biomaRt or not, I think you will need to remove the version numbers ".1", ".5" etc from the Ensembl gene Ids before you will able to map them.

score 2 · Answer 2 · 2018-10-23

Gordon is correct, you will have to remove the version numbers from the Ensembl IDs before you can do anything with biomaRt. You should also note that the code you are using for biomaRt doesn't make any sense, as you are using code intended for a Bioconductor OrgDb package instead of actual code that will work for biomaRt. The correct call would be

> gns <- c("ENSMUSG00000102693.1", "ENSMUSG00000064842.1", "ENSMUSG00000051951.5","ENSMUSG00000102851.1", "ENSMUSG00000103377.1", "ENSMUSG00000104017.1")
> mart <- useMart("ensembl","mmusculus_gene_ensembl")

## try to map, including the version numbers
> getBM(c("ensembl_gene_id","entrezgene"), "ensembl_gene_id", gns, mart)
[1] ensembl_gene_id entrezgene     
<0 rows> (or 0-length row.names)

## and now, after stripping them off
> getBM(c("ensembl_gene_id","entrezgene"), "ensembl_gene_id", gsub("\\.[1-9]$", "", gns), mart)
     ensembl_gene_id entrezgene
1 ENSMUSG00000051951     497097
2 ENSMUSG00000064842         NA
3 ENSMUSG00000102693         NA
4 ENSMUSG00000102851         NA
5 ENSMUSG00000103377         NA
6 ENSMUSG00000104017         NA

As Gordon also noted, you should start out with the annotation service you want to use. There is no profit in trying to map from EBI/EMBL or GENCODE IDs to NCBI IDs, because there are any number of technical reasons that a particular ID might not map. For example, if we include the MGI symbols in our call to getBM, we can then use those to try to map Gene IDs to Ensembl Gene IDs

> z <- getBM(c("ensembl_gene_id","entrezgene","mgi_symbol"), "ensembl_gene_id", gsub("\\.[1-9]$", "", gns), mart)
> z
     ensembl_gene_id entrezgene    mgi_symbol
1 ENSMUSG00000051951     497097          Xkr4
2 ENSMUSG00000064842         NA       Gm26206
3 ENSMUSG00000102693         NA 4933401J01Rik
4 ENSMUSG00000102851         NA       Gm18956
5 ENSMUSG00000103377         NA       Gm37180
6 ENSMUSG00000104017         NA       Gm37363

> library(org.Mm.eg.db)
> select(org.Mm.eg.db, z[,3], c("ENTREZID","ENSEMBL"), "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
         SYMBOL  ENTREZID            ENSEMBL
1          Xkr4    497097 ENSMUSG00000051951
2       Gm26206      <NA>               <NA>
3 4933401J01Rik     71042               <NA>
4       Gm18956 100418032               <NA>
5       Gm37180      <NA>               <NA>
6       Gm37363      <NA>               <NA>

So trying to map annotations between the different annotation services is difficult, because (for instance), all those Gm genes are predicted genes (predicted according to EBI/EMBL), but NCBI doesn't think they are a thing. And there are any number of NCBI predicted genes that don't have Ensembl IDs. Unless you care to know all the little technical details about what each service thinks is a gene, and where they differ, it's just best to pick on and stick with it.