Why does biomaRt query show inconsistent behavior with ensembl versions from March and May
0
2
Entering edit mode
jmeisig ▴ 20
@jmeisig-8239
Last seen 4.5 years ago
Germany

Hi,

I noticed a problem with biomart driven ensembl querys and it seems that the behaviour depends on the ensembl version used. I'm trying to obtain ggallus homologs for mouse genes on all autosomes and the X. With the May 2015 ensembl archive I need to query each chromosome in a loop to prevent missing genes, with the March 2015 version I can query all chromosomes at once.

 

library(biomaRt)
ensembl.new <- useMart("ENSEMBL_MART_ENSEMBL",host="may2015.archive.ensembl.org")
ensemblmmusculus.new = useDataset("mmusculus_gene_ensembl",mart=ensembl.new)
aa <- lapply(c("X",as.character(1:19)),function(x) getBM(attributes = c("ensembl_gene_id", "ggallus_homolog_orthology_type", "ggallus_homolog_orthology_confidence","ggallus_homolog_chromosome", "chromosome_name"), filter="chromosome_name", values=x, mart=ensemblmmusculus.new))
print(Reduce("+",lapply(aa,nrow)))
chromosome.input <- getBM(attributes = c("ensembl_gene_id", "ggallus_homolog_orthology_type", "ggallus_homolog_orthology_confidence","ggallus_homolog_chromosome", "chromosome_name"), filter="chromosome_name", values=c("X",as.character(1:19)), mart=ensemblmmusculus.new)
print(nrow(chromosome.input))

ensembl.new <- useMart("ENSEMBL_MART_ENSEMBL",host="mar2015.archive.ensembl.org")
ensemblmmusculus.new = useDataset("mmusculus_gene_ensembl",mart=ensembl.new)
aa <- lapply(c("X",as.character(1:19)),function(x) getBM(attributes = c("ensembl_gene_id", "ggallus_homolog_orthology_type", "ggallus_homolog_orthology_confidence","ggallus_homolog_chromosome", "chromosome_name"), filter="chromosome_name", values=x, mart=ensemblmmusculus.new))
print(Reduce("+",lapply(aa,nrow)))
chromosome.input <- getBM(attributes = c("ensembl_gene_id", "ggallus_homolog_orthology_type", "ggallus_homolog_orthology_confidence","ggallus_homolog_chromosome", "chromosome_name"), filter="chromosome_name", values=c("X",as.character(1:19)), mart=ensemblmmusculus.new)
print(nrow(chromosome.input))

 

The difference can be clearly seen in the number of rows the print results show. For the May version I get 44679 genes for the loop and 27372 for the non-looped version. With the March version I get 42561 genes for both ways.

 

 

sessionInfo()

R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 7 (wheezy)

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8    
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=C            
 [7] LC_PAPER=en_US.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.24.0

loaded via a namespace (and not attached):
 [1] compiler_3.2.1       IRanges_2.0.1        DBI_0.3.1           
 [4] parallel_3.2.1       tools_3.2.1          RCurl_1.95-4.6      
 [7] Biobase_2.26.0       AnnotationDbi_1.30.1 RSQLite_1.0.0       
[10] S4Vectors_0.6.0      BiocGenerics_0.14.0  GenomeInfoDb_1.4.1  
[13] stats4_3.2.1         bitops_1.0-6         XML_3.98-1.2        
biomart ensembl • 1.1k views
ADD COMMENT
0
Entering edit mode

Apparently the problem can be boiled down to a simpler query. The strange thing is that adding one chromosome (19) actually gives you less genes.

ensembl.new <- useMart("ENSEMBL_MART_ENSEMBL",host="may2015.archive.ensembl.org")
ensemblmmusculus.new = useDataset("mmusculus_gene_ensembl",mart=ensembl.new)
query.1 <- getBM(attributes = c("ensembl_gene_id","ggallus_homolog_chromosome", "chromosome_name"), filter="chromosome_name", values=c("X",as.character(1:18)), mart=ensemblmmusculus.new)
query.2 <- getBM(attributes = c("ensembl_gene_id","ggallus_homolog_chromosome", "chromosome_name"), filter="chromosome_name", values=c("X",as.character(1:19)), mart=ensemblmmusculus.new)
nrow(query.1)-nrow(query.2)

[1] 16266

 

With

ensembl.new <- useMart("ENSEMBL_MART_ENSEMBL",host="mar2015.archive.ensembl.org")

the difference is negative, as it should be.

 

ADD REPLY

Login before adding your answer.

Traffic: 599 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6