using biomaRt to look up human gene symbols and map them to human ensembl ENSG IDs
1
0
Entering edit mode
@a235c5de
Last seen 20 months ago
United States

hello bioconductor community,

firstly, to those who did - thanks for developing and maintaining biomaRt - i use it often and its a great resource.

recently, i have been using biomaRt to look up human gene symbols from public RNAseq data and map them to human ENSG IDs. however, when i try to look up a list of about 30k symbols, only about 20k find matches using the getLDS() function as below. Do you know why this might be? it may look a little odd as i am designing it to look up gene symbols across species.

mrt = useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
tbl.match = getLDS(attributes = "ensembl_gene_id", mart = mrt, filters = "ensembl_gene_id", values = HumanGeneSymbolsFromRNAseq, martL = mrt)

it appears that most of the "conventional", well-studied/named genes map, but there are many non-coding and pseudo genes and others that have gene symbols that do not appear in the biomaRt lookup - are there different versions of gene symbols that researchers may be using that do not map in biomaRt?

thanks for any insight you might have,

Carlo

biomaRt • 4.3k views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 2 hours ago
United States

This doesn't require getLDS, because that is intended for mapping between species. You just want to use regular getBM instead. You might also try an EnsDb

> library(biomaRt)
> mart <- useEnsembl("ensembl","hsapiens_gene_ensembl")
> symb <- keys(org.Hs.eg.db, "SYMBOL")
> head(symb)
[1] "A1BG"  "A2M"   "A2MP1" "NAT1"  "NAT2"  "NATP" 
> z <- getBM(c("ensembl_gene_id","hgnc_symbol"), "hgnc_symbol", symb, mart)
> head(z)
  ensembl_gene_id hgnc_symbol
1 ENSG00000121410        A1BG
2 ENSG00000175899         A2M
3 ENSG00000256069       A2MP1
4 ENSG00000114771       AADAC
5 ENSG00000127837        AAMP
6 ENSG00000129673       AANAT
> dim(z)
[1] 44160     2
> length(symb)
[1] 66091

## you are right - not all symbols map. Let's try an EnsDb
> library(AnnotationHub)
Loading required package: BiocFileCache
Loading required package: dbplyr

Attaching package: 'AnnotationHub'

The following object is masked from 'package:Biobase':

    cache

> hub <- AnnotationHub()
  |======================================================================| 100%

snapshotDate(): 2022-04-21
> query(hub, c("homo sapiens","ensdb"))
AnnotationHub with 21 records
# snapshotDate(): 2022-04-21
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53211"]]' 

             title                             
  AH53211  | Ensembl 87 EnsDb for Homo Sapiens 
  AH53715  | Ensembl 88 EnsDb for Homo Sapiens 
  AH56681  | Ensembl 89 EnsDb for Homo Sapiens 
  AH57757  | Ensembl 90 EnsDb for Homo Sapiens 
  AH60773  | Ensembl 91 EnsDb for Homo Sapiens 
  ...        ...                               
  AH89180  | Ensembl 102 EnsDb for Homo sapiens
  AH89426  | Ensembl 103 EnsDb for Homo sapiens
  AH95744  | Ensembl 104 EnsDb for Homo sapiens
  AH98047  | Ensembl 105 EnsDb for Homo sapiens
  AH100643 | Ensembl 106 EnsDb for Homo sapiens
> ensdb <- hub[["AH100643"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache
require("ensembldb")
> zz <- select(ensdb, symb, "GENEID", "GENENAME")
> head(zz)
  GENENAME          GENEID
1     A1BG ENSG00000121410
2      A2M ENSG00000175899
3      A2M         LRG_591
4    A2MP1 ENSG00000256069
5     NAT1 ENSG00000171428
6     NAT2 ENSG00000156006
> dim(zz)
[1] 45707     2

## Still not 100% mapping

This is a mapping between HGNC and EBI/EMBL, and not all genes that have symbols are thought to be genes by EBI/EMBL. As an example

> head(symb[!symb %in% z[,2]])
[1] "AAVS1"   "ACLS"    "ACTBP3"  "ACTG1P6" "ACTG1P7" "ACTG1P8"

And AAVS1 has no Ensembl Gene ID, but for example A1BG does.

ADD COMMENT

Login before adding your answer.

Traffic: 727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6