biomart query in R is not matching ensembl website - missing results
2
0
Entering edit mode
Alice • 0
@b657b6a6
Last seen 2 days ago
United States

Here is a fun little problem. Here is an example gene: ENSG00000006074 - CCL18

I get the result expected on the ensembl website (including ENSG00000107331 as a positive control): enter image description here

Here is the XML query from the biomart site (http://grch37.ensembl.org/index.html)

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query  virtualSchemaName = "default" formatter = "TSV" header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" >

    <Dataset name = "hsapiens_gene_ensembl" interface = "default" >
        <Filter name = "ensembl_gene_id" value = "ENSG00000006074,ENSG00000107331"/>
        <Attribute name = "ensembl_gene_id" />
        <Attribute name = "ensembl_gene_id_version" />
        <Attribute name = "hgnc_symbol" />
        <Attribute name = "description" />
        <Attribute name = "gene_biotype" />
    </Dataset>
</Query>

However, let's try in R (using either biomaRt or do it ourself):

fullXmlQuery <- "<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE Query><Query  virtualSchemaName = 'default' uniqueRows = '0' count='' datasetConfigVersion='0.6' header='1' formatter='TSV' requestid='biomaRt'> <Dataset name = 'hsapiens_gene_ensembl' interface = 'default'><Attribute name = 'ensembl_gene_id'/><Attribute name = 'description'/><Attribute name = 'hgnc_symbol'/><Attribute name = 'gene_biotype'/><Filter name = 'ensembl_gene_id' value = 'ENSG00000006074,ENSG00000107331' /></Dataset></Query>"

res <- httr::POST(url = "https://apr2019.archive.ensembl.org:443/biomart/martservice",
                  body = list('query' = fullXmlQuery),
                  config = httr::config())

httr::content(res)

This will only return results for ENSG00000107331 and never ENSG00000006074.

[1] "Gene stable ID\tGene description\tHGNC symbol\tGene type\nENSG00000107331\tATP binding cassette subfamily A member 2 [Source:HGNC Symbol;Acc:HGNC:32]\tABCA2\tprotein_coding\n"

I also see this issue if you try a service like g:Profiler (https://biit.cs.ut.ee/gprofiler/convert): enter image description here

Here is the biomaRt package query too:

ens_version <- biomaRt::useEnsembl(biomart = 'genes', 
                                   dataset = 'hsapiens_gene_ensembl',
                                   version = 96)
# return
biomaRt::getBM(attributes = c("ensembl_gene_id",
                                            "description",
                                            "hgnc_symbol",
                                            "gene_biotype"), #,"entrezgene"
                             filters = 'ensembl_gene_id',
                             values = "ENSG00000006074",
                             mart = ens_version,
                             uniqueRows = FALSE)

Returns:

[1] ensembl_gene_id description     hgnc_symbol     gene_biotype   
<0 rows> (or 0-length row.names)

Is this an encoding issue or a known issue due to versions? Note that I have tried several ensembl versions. I have not exhaustively searched for all genes that have an issue.

ensembldb biomaRt • 80 views
ADD COMMENT
0
Entering edit mode

It appears that this is an GRCh37 versus GRCh38 issue. I want to map to hg19/GRCh37 IDs but the default for these services are GRCh38.

If anyone else finds this, you can change assembly in biomaRt like this:

ens_version <- biomaRt::useEnsembl(biomart = 'genes', 
                                   GRCh = 37,
                                   dataset = 'hsapiens_gene_ensembl'
                                   )

The url for queries is: "https://grch37.ensembl.org:443/biomart/martservice"

ADD REPLY
1
Entering edit mode
Alice • 0
@b657b6a6
Last seen 2 days ago
United States

I answered this myself - just remember to check what assembly you need and query the correct one.

ADD COMMENT
1
Entering edit mode
Mike Smith ★ 5.2k
@mike-smith
Last seen 1 minute ago
EMBL Heidelberg / de.NBI

I think the key point here is that you're querying different Ensembl versions. Most of the data in the GRCh37 version is now pretty old - Ensembl release 75 from February 2014. When you're doing a BioMart query on the https://grch37.ensembl.org/ (whether in the browser or via biomaRt) you're getting annotation data from that time point. They did update some variation & regulation data recently, but mostly it's static to Feb 2014.

When you switch to Ensembl version 96 you're getting the annotation that was deemed correct in April 2019. Since you don't find ENSG00000006074 it was presumably retired from Ensembl sometime between those two releases.

If you search for ENSG00000006074 in the current version of Ensembl you'll see that it was removed from the database after version 75. Presumably there was something quite different between the GRCh37 and GRCh38 assemblies, but you might have to dig to find out what exactly.

ENSG00000006074 details from Ensembl v105

Here's a biomaRt query that gets data from https://grch37.ensembl.org/ and returns results for both genes. Whether you want to be using annotation from that time point is up to you.

library(biomaRt)

ensembl_GRCh37 <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl', GRCh = 37)
getBM(attributes = c("ensembl_gene_id",
                     "description",
                     "hgnc_symbol",
                     "gene_biotype"), 
      filters = 'ensembl_gene_id',
      values = c("ENSG00000006074", "ENSG00000107331"), 
      mart = ensembl_GRCh37)
#>   ensembl_gene_id
#> 1 ENSG00000006074
#> 2 ENSG00000107331
#>                                                                                           description
#> 1 chemokine (C-C motif) ligand 18 (pulmonary and activation-regulated) [Source:HGNC Symbol;Acc:10616]
#> 2                     ATP-binding cassette, sub-family A (ABC1), member 2 [Source:HGNC Symbol;Acc:32]
#>   hgnc_symbol   gene_biotype
#> 1       CCL18 protein_coding
#> 2       ABCA2 protein_coding
ADD COMMENT

Login before adding your answer.

Traffic: 273 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6