convert featureCounts's gene ID to entrez gene id
Entering edit mode
YinCY ▴ 20
Last seen 3.2 years ago
HangZhou, zhejaing

I'm using featureCounts (from Rsubread R/Bioconductor package) and gencode annotation file to do features sumarization, but the gene ids are as below, how can i convert those  ID to entrez gene ids.

[1] "ENSMUSG00000102693.1" "ENSMUSG00000064842.1" "ENSMUSG00000051951.5"

[4] "ENSMUSG00000102851.1" "ENSMUSG00000103377.1" "ENSMUSG00000104017.1"

and i'm using BiomaRt R package to convert the ids like this

mart <- useMart(biomart = 'ensembl', dataset = 'mmusculus_gene_ensembl')

genes$entrez <- select(x = mart,
                       keys = as.character(genes$ensembl),
                       keytype = 'ensembl_gene_id_version',
                       column = 'entrezgene')

but it does't works!

biomart rsubread gencode • 3.6k views
Entering edit mode

This is a question about biomaRt so I have added biomaRt as a tag.

Entering edit mode

ok, thanks.

Entering edit mode
Last seen 1 hour ago
WEHI, Melbourne, Australia

If you want to work with Entrez Gene IDs, then it would be simpler and better to use featureCounts with Rsubread's in-built mouse annotation in the first place instead of Gencode. It is very fast and easy to do that. Then you would have Entrez Gene IDs directly and you would get a count for every possible Entrez Gene ID.

If you have some strong reason to use Gencode annotation, but want Entrez Gene Ids as well, then I would use Genecode annotation directly, which you can download from the Gencode website. BiomaRt can only give you Ensembl mappings, whereas Gencode is a combination of Ensembl plus other annotation sources.

Whether you use biomaRt or not, I think you will  need to remove the version numbers ".1", ".5" etc from the Ensembl gene Ids before you will able to map them.

Entering edit mode

it's very helpful! thanks Gordon.

Entering edit mode
Last seen 4 hours ago
United States

Gordon is correct, you will have to remove the version numbers from the Ensembl IDs before you can do anything with biomaRt. You should also note that the code you are using for biomaRt doesn't make any sense, as you are using code intended for a Bioconductor OrgDb package instead of actual code that will work for biomaRt. The correct call would be

> gns <- c("ENSMUSG00000102693.1", "ENSMUSG00000064842.1", "ENSMUSG00000051951.5","ENSMUSG00000102851.1", "ENSMUSG00000103377.1", "ENSMUSG00000104017.1")
> mart <- useMart("ensembl","mmusculus_gene_ensembl")

## try to map, including the version numbers
> getBM(c("ensembl_gene_id","entrezgene"), "ensembl_gene_id", gns, mart)
[1] ensembl_gene_id entrezgene     
<0 rows> (or 0-length row.names)

## and now, after stripping them off
> getBM(c("ensembl_gene_id","entrezgene"), "ensembl_gene_id", gsub("\\.[1-9]$", "", gns), mart)
     ensembl_gene_id entrezgene
1 ENSMUSG00000051951     497097
2 ENSMUSG00000064842         NA
3 ENSMUSG00000102693         NA
4 ENSMUSG00000102851         NA
5 ENSMUSG00000103377         NA
6 ENSMUSG00000104017         NA

As Gordon also noted, you should start out with the annotation service you want to use. There is no profit in trying to map from EBI/EMBL or GENCODE IDs to NCBI IDs, because there are any number of technical reasons that a particular ID might not map. For example, if we include the MGI symbols in our call to getBM, we can then use those to try to map Gene IDs to Ensembl Gene IDs

> z <- getBM(c("ensembl_gene_id","entrezgene","mgi_symbol"), "ensembl_gene_id", gsub("\\.[1-9]$", "", gns), mart)
> z
     ensembl_gene_id entrezgene    mgi_symbol
1 ENSMUSG00000051951     497097          Xkr4
2 ENSMUSG00000064842         NA       Gm26206
3 ENSMUSG00000102693         NA 4933401J01Rik
4 ENSMUSG00000102851         NA       Gm18956
5 ENSMUSG00000103377         NA       Gm37180
6 ENSMUSG00000104017         NA       Gm37363

> library(
> select(, z[,3], c("ENTREZID","ENSEMBL"), "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
         SYMBOL  ENTREZID            ENSEMBL
1          Xkr4    497097 ENSMUSG00000051951
2       Gm26206      <NA>               <NA>
3 4933401J01Rik     71042               <NA>
4       Gm18956 100418032               <NA>
5       Gm37180      <NA>               <NA>
6       Gm37363      <NA>               <NA>

So trying to map annotations between the different annotation services is difficult, because (for instance), all those Gm genes are predicted genes (predicted according to EBI/EMBL), but NCBI doesn't think they are a thing. And there are any number of NCBI predicted genes that don't have Ensembl IDs. Unless you care to know all the little technical details about what each service thinks is a gene, and where they differ, it's just best to pick on and stick with it.

Entering edit mode

Thank you for your generous help! I'm using gencode annotation file because the author recommended. thanks again!


Login before adding your answer.

Traffic: 234 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6