Disagreement between results of two 'getBM' queries
1
0
Entering edit mode
matrs ▴ 10
@matrs-15100
Last seen 3.4 years ago

R version 3.6.0 (2019-04-26)

Platform: x86_64-pc-linux-gnu (64-bit)

BiocManager 1.30.4

biomaRt_2.40.3 (latest)

I've been working with the biomaRt package to get mappings between different organizations and I noticed something that, to me, seems like a bug, but I'm not sure.

I'm trying to get mappings between ensembl_gene_id and entrezgene_id. I started building a bigger dataframe because i needed more info, so I defined names_mart

mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL", dataset="scerevisiae_gene_ensembl", host = "ensembl.org")

names_mart <- getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id", "entrezgene_id","external_gene_name", "kegg_enzyme" ,"goslim_goa_accession", "goslim_goa_description","description"), mart = mart)

#This contains only 'ensembl_gene_id' and 'entrezgene_id' columns
ensembl2entrez <- names_mart[, 2:3]

#Getting the same info directly from 'getBM'
ensembl2entrez_bio <- getBM(attributes = c("ensembl_gene_id", "entrezgene_id"), mart = mart)

#Only unique and non-NA elements for 'entrezgene_id'. These two dataframes should have the same 'entrezgene_id', but it isn't the case

ensembl2entrez <- ensembl2entrez[unique(which(!is.na(ensembl2entrez$entrezgene_id))),]
ensembl2entrez_bio <- ensembl2entrez_bio[unique(which(!is.na(ensembl2entrez_bio$entrezgene_id))),]

apply(ensembl2entrez, 2, function(df) length(unique(df)))
#[1]ensembl_gene_id   entrezgene_id 
#           5507            5505 
apply(ensembl2entrez_bio, 2, function(df) length(unique(df)))
#[1]ensembl_gene_id   entrezgene_id 
#          5804            5801 

ensem2entre_intersect <- intersect(ensembl2entrez_bio$entrezgene_id,ensembl2entrez$entrezgene_id)
length(ensem2entre_intersect)
#[1]5505

ensem2entre_set_diff <- setdiff(ensembl2entrez_bio$entrezgene_id,ensembl2entrez$entrezgene_id)
length(ensem2entre_set_diff)
#[1] 296

I don't understand why this difference between the number of unique elements when i compare these two biomart queries. What could be some of the reasons that explain this difference?

BiomaRt biomart • 755 views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

When you put lots of attributes in, you are doing some sort of join between different database tables, and it looks like the GO table might be messing you up.

> z <- head(ensembl2entrez_bio[!ensembl2entrez_bio$entrezgene_id %in% ensembl2entrez$entrezgene_id,1])
> z
[1] "YBR225W"   "YBR182C-A" "YOR114W"   "YIL156W-B" "YAL037W"   "YGL015C" 
> getBM(c("ensembl_gene_id", "entrezgene_id"), "ensembl_gene_id", z, mart)
  ensembl_gene_id entrezgene_id
1         YAL037W        851194
2       YBR182C-A       1466446
3         YBR225W        852526
4         YGL015C        852869
5       YIL156W-B       3628034
6         YOR114W        854281
> getBM(c("ensembl_gene_id", "entrezgene_id", "goslim_goa_accession"), "ensembl_gene_id", z, mart)
[1] ensembl_gene_id      entrezgene_id        goslim_goa_accession
<0 rows> (or 0-length row.names)

> getBM(c("ensembl_gene_id", "goslim_goa_accession" ), "ensembl_gene_id", z, mart)
[1] ensembl_gene_id      goslim_goa_accession
<0 rows> (or 0-length row.names)

> getBM(c("ensembl_gene_id", "go_id" ), "ensembl_gene_id", z, mart)
  ensembl_gene_id go_id
1         YAL037W    NA
2       YBR182C-A    NA
3         YBR225W    NA
4         YGL015C    NA
5       YIL156W-B    NA
6         YOR114W    NA

It may well be a combination of different tables, depending on the gene, but the more attributes you ask for, the more likely you will run into things like this. Not sure I would call it a bug per se, more like the consequence of joining a bunch of tables where there is probably a reducing type join being used, to keep the results from blowing up exponentially.

ADD COMMENT
0
Entering edit mode

Thanks for your answer, i'll keep in mind this, to query as few as possible attributes at once.

ADD REPLY

Login before adding your answer.

Traffic: 701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6