Search
Question: Why does biomaRt query show inconsistent behavior with ensembl versions from March and May
2
gravatar for jmeisig
2.7 years ago by
jmeisig20
Germany
jmeisig20 wrote:

Hi,

I noticed a problem with biomart driven ensembl querys and it seems that the behaviour depends on the ensembl version used. I'm trying to obtain ggallus homologs for mouse genes on all autosomes and the X. With the May 2015 ensembl archive I need to query each chromosome in a loop to prevent missing genes, with the March 2015 version I can query all chromosomes at once.

 

library(biomaRt)
ensembl.new <- useMart("ENSEMBL_MART_ENSEMBL",host="may2015.archive.ensembl.org")
ensemblmmusculus.new = useDataset("mmusculus_gene_ensembl",mart=ensembl.new)
aa <- lapply(c("X",as.character(1:19)),function(x) getBM(attributes = c("ensembl_gene_id", "ggallus_homolog_orthology_type", "ggallus_homolog_orthology_confidence","ggallus_homolog_chromosome", "chromosome_name"), filter="chromosome_name", values=x, mart=ensemblmmusculus.new))
print(Reduce("+",lapply(aa,nrow)))
chromosome.input <- getBM(attributes = c("ensembl_gene_id", "ggallus_homolog_orthology_type", "ggallus_homolog_orthology_confidence","ggallus_homolog_chromosome", "chromosome_name"), filter="chromosome_name", values=c("X",as.character(1:19)), mart=ensemblmmusculus.new)
print(nrow(chromosome.input))

ensembl.new <- useMart("ENSEMBL_MART_ENSEMBL",host="mar2015.archive.ensembl.org")
ensemblmmusculus.new = useDataset("mmusculus_gene_ensembl",mart=ensembl.new)
aa <- lapply(c("X",as.character(1:19)),function(x) getBM(attributes = c("ensembl_gene_id", "ggallus_homolog_orthology_type", "ggallus_homolog_orthology_confidence","ggallus_homolog_chromosome", "chromosome_name"), filter="chromosome_name", values=x, mart=ensemblmmusculus.new))
print(Reduce("+",lapply(aa,nrow)))
chromosome.input <- getBM(attributes = c("ensembl_gene_id", "ggallus_homolog_orthology_type", "ggallus_homolog_orthology_confidence","ggallus_homolog_chromosome", "chromosome_name"), filter="chromosome_name", values=c("X",as.character(1:19)), mart=ensemblmmusculus.new)
print(nrow(chromosome.input))

 

The difference can be clearly seen in the number of rows the print results show. For the May version I get 44679 genes for the loop and 27372 for the non-looped version. With the March version I get 42561 genes for both ways.

 

 

sessionInfo()

R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 7 (wheezy)

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8    
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=C            
 [7] LC_PAPER=en_US.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.24.0

loaded via a namespace (and not attached):
 [1] compiler_3.2.1       IRanges_2.0.1        DBI_0.3.1           
 [4] parallel_3.2.1       tools_3.2.1          RCurl_1.95-4.6      
 [7] Biobase_2.26.0       AnnotationDbi_1.30.1 RSQLite_1.0.0       
[10] S4Vectors_0.6.0      BiocGenerics_0.14.0  GenomeInfoDb_1.4.1  
[13] stats4_3.2.1         bitops_1.0-6         XML_3.98-1.2        
ADD COMMENTlink written 2.7 years ago by jmeisig20

Apparently the problem can be boiled down to a simpler query. The strange thing is that adding one chromosome (19) actually gives you less genes.

ensembl.new <- useMart("ENSEMBL_MART_ENSEMBL",host="may2015.archive.ensembl.org")
ensemblmmusculus.new = useDataset("mmusculus_gene_ensembl",mart=ensembl.new)
query.1 <- getBM(attributes = c("ensembl_gene_id","ggallus_homolog_chromosome", "chromosome_name"), filter="chromosome_name", values=c("X",as.character(1:18)), mart=ensemblmmusculus.new)
query.2 <- getBM(attributes = c("ensembl_gene_id","ggallus_homolog_chromosome", "chromosome_name"), filter="chromosome_name", values=c("X",as.character(1:19)), mart=ensemblmmusculus.new)
nrow(query.1)-nrow(query.2)

[1] 16266

 

With

ensembl.new <- useMart("ENSEMBL_MART_ENSEMBL",host="mar2015.archive.ensembl.org")

the difference is negative, as it should be.

 

ADD REPLYlink written 2.7 years ago by jmeisig20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 483 users visited in the last hour