Question: Disagreement between results of two 'getBM' queries
0
gravatar for matrs
28 days ago by
matrs0
matrs0 wrote:

R version 3.6.0 (2019-04-26)

Platform: x86_64-pc-linux-gnu (64-bit)

BiocManager 1.30.4

biomaRt_2.40.3 (latest)

I've been working with the biomaRt package to get mappings between different organizations and I noticed something that, to me, seems like a bug, but I'm not sure.

I'm trying to get mappings between ensembl_gene_id and entrezgene_id. I started building a bigger dataframe because i needed more info, so I defined names_mart

mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL", dataset="scerevisiae_gene_ensembl", host = "ensembl.org")

names_mart <- getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id", "entrezgene_id","external_gene_name", "kegg_enzyme" ,"goslim_goa_accession", "goslim_goa_description","description"), mart = mart)

#This contains only 'ensembl_gene_id' and 'entrezgene_id' columns
ensembl2entrez <- names_mart[, 2:3]

#Getting the same info directly from 'getBM'
ensembl2entrez_bio <- getBM(attributes = c("ensembl_gene_id", "entrezgene_id"), mart = mart)

#Only unique and non-NA elements for 'entrezgene_id'. These two dataframes should have the same 'entrezgene_id', but it isn't the case

ensembl2entrez <- ensembl2entrez[unique(which(!is.na(ensembl2entrez$entrezgene_id))),]
ensembl2entrez_bio <- ensembl2entrez_bio[unique(which(!is.na(ensembl2entrez_bio$entrezgene_id))),]

apply(ensembl2entrez, 2, function(df) length(unique(df)))
#[1]ensembl_gene_id   entrezgene_id 
#           5507            5505 
apply(ensembl2entrez_bio, 2, function(df) length(unique(df)))
#[1]ensembl_gene_id   entrezgene_id 
#          5804            5801 

ensem2entre_intersect <- intersect(ensembl2entrez_bio$entrezgene_id,ensembl2entrez$entrezgene_id)
length(ensem2entre_intersect)
#[1]5505

ensem2entre_set_diff <- setdiff(ensembl2entrez_bio$entrezgene_id,ensembl2entrez$entrezgene_id)
length(ensem2entre_set_diff)
#[1] 296

I don't understand why this difference between the number of unique elements when i compare these two biomart queries. What could be some of the reasons that explain this difference?

biomart • 72 views
ADD COMMENTlink modified 28 days ago • written 28 days ago by matrs0
Answer: Disagreement between results of two 'getBM' queries
1
gravatar for James W. MacDonald
28 days ago by
United States
James W. MacDonald50k wrote:

When you put lots of attributes in, you are doing some sort of join between different database tables, and it looks like the GO table might be messing you up.

> z <- head(ensembl2entrez_bio[!ensembl2entrez_bio$entrezgene_id %in% ensembl2entrez$entrezgene_id,1])
> z
[1] "YBR225W"   "YBR182C-A" "YOR114W"   "YIL156W-B" "YAL037W"   "YGL015C" 
> getBM(c("ensembl_gene_id", "entrezgene_id"), "ensembl_gene_id", z, mart)
  ensembl_gene_id entrezgene_id
1         YAL037W        851194
2       YBR182C-A       1466446
3         YBR225W        852526
4         YGL015C        852869
5       YIL156W-B       3628034
6         YOR114W        854281
> getBM(c("ensembl_gene_id", "entrezgene_id", "goslim_goa_accession"), "ensembl_gene_id", z, mart)
[1] ensembl_gene_id      entrezgene_id        goslim_goa_accession
<0 rows> (or 0-length row.names)

> getBM(c("ensembl_gene_id", "goslim_goa_accession" ), "ensembl_gene_id", z, mart)
[1] ensembl_gene_id      goslim_goa_accession
<0 rows> (or 0-length row.names)

> getBM(c("ensembl_gene_id", "go_id" ), "ensembl_gene_id", z, mart)
  ensembl_gene_id go_id
1         YAL037W    NA
2       YBR182C-A    NA
3         YBR225W    NA
4         YGL015C    NA
5       YIL156W-B    NA
6         YOR114W    NA

It may well be a combination of different tables, depending on the gene, but the more attributes you ask for, the more likely you will run into things like this. Not sure I would call it a bug per se, more like the consequence of joining a bunch of tables where there is probably a reducing type join being used, to keep the results from blowing up exponentially.

ADD COMMENTlink written 28 days ago by James W. MacDonald50k

Thanks for your answer, i'll keep in mind this, to query as few as possible attributes at once.

ADD REPLYlink written 28 days ago by matrs0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 299 users visited in the last hour