I would like to convert murine ensembl gene ids to human ensemble gene ids using biomaRt.
library(biomaRt)
mart1 <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")
mart2 <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
genes.ensembl.biomart <- getLDS(attributes = c("ensembl_gene_id"), filters = "ensembl_gene_id", values = genes.ensembl.murine, mart = mart1, attributesL = c("ensembl_gene_id"), martL = mart2)
To keep the order I used match.
genes.ensembl <- data.frame (murine_ensembl = genes.ensembl.murine)
genes.ensembl$human_ensembl <- genes.ensembl.biomart[match(genes.ensembl[,1], genes.ensembl.biomart[,1]),2]
genes.ensembl.murine is a vector of length 14040.
head(genes.ensembl.murine, 20)
[1] ENSMUSG00000025902 ENSMUSG00000033845 ENSMUSG00000025903 ENSMUSG00000033813
[5] ENSMUSG00000033793 ENSMUSG00000025907 ENSMUSG00000051285 ENSMUSG00000061024
[9] ENSMUSG00000025911 ENSMUSG00000045210 ENSMUSG00000025915 ENSMUSG00000098234
[13] ENSMUSG00000025917 ENSMUSG00000056763 ENSMUSG00000067851 ENSMUSG00000048960
[17] ENSMUSG00000016918 ENSMUSG00000005886 ENSMUSG00000025935 ENSMUSG00000025937
The resulting data frame has 1396 missing values.
head(genes.ensembl[whichis.na(genes.ensembl[2])),], 10)
murine_ensembl human_ensembl
12 ENSMUSG00000098234 <NA>
24 ENSMUSG00000043716 <NA>
43 ENSMUSG00000026064 <NA>
82 ENSMUSG00000073702 <NA>
85 ENSMUSG00000091937 <NA>
133 ENSMUSG00000025980 <NA>
134 ENSMUSG00000073676 <NA>
137 ENSMUSG00000097649 <NA>
146 ENSMUSG00000026035 <NA>
156 ENSMUSG00000097573 <NA>
Ensembl says ENSMUSG00000098234 is Snhg6 (http://www.ensembl.org/Musmusculus/Gene/Summary?g=ENSMUSG00000098234;r=1:9941959-9944118) , and the human orthologue is ENSG00000245910 (http://www.ensembl.org/Homosapiens/Gene/Summary?g=ENSG00000245910;r=8:66921684-66926398). However, using biomart on ensembl.org also doesn't find the human orthologue for ENSMUSG00000098234 .
Can anybody help me to convert the missing 1396 genes? Is it a problem with biomaRt or with ensembl.org?
Thank you very much. Mischko
Thanks James, that makes senses. So it s rather a biological than a technical issue. So how do you proceed if EMBL do not find orthologue genes ids. Is there any other more comprehensive database or do you just accept the 10% drop outs?
I don't know - that is a question that you will have to answer yourself. I don't know why EBI doesn't think those are orthologs, so without knowing that, how could I say that they 'do not find orthologue gene ids'? Maybe they do find them all, and there is a reason to think that those 1400 genes have no human orthologs. Or maybe they are not doing a good job and somebody else has better data.
But answering those questions requires a much deeper knowledge of the algorithm that EBI uses to define orthologs, and I have at best a superficial knowledge, so would be loath to say that what EBI is doing is not correct.