Entering edit mode
Hi,
I noticed a problem with biomart driven ensembl querys and it seems that the behaviour depends on the ensembl version used. I'm trying to obtain ggallus homologs for mouse genes on all autosomes and the X. With the May 2015 ensembl archive I need to query each chromosome in a loop to prevent missing genes, with the March 2015 version I can query all chromosomes at once.
library(biomaRt) ensembl.new <- useMart("ENSEMBL_MART_ENSEMBL",host="may2015.archive.ensembl.org") ensemblmmusculus.new = useDataset("mmusculus_gene_ensembl",mart=ensembl.new) aa <- lapply(c("X",as.character(1:19)),function(x) getBM(attributes = c("ensembl_gene_id", "ggallus_homolog_orthology_type", "ggallus_homolog_orthology_confidence","ggallus_homolog_chromosome", "chromosome_name"), filter="chromosome_name", values=x, mart=ensemblmmusculus.new)) print(Reduce("+",lapply(aa,nrow))) chromosome.input <- getBM(attributes = c("ensembl_gene_id", "ggallus_homolog_orthology_type", "ggallus_homolog_orthology_confidence","ggallus_homolog_chromosome", "chromosome_name"), filter="chromosome_name", values=c("X",as.character(1:19)), mart=ensemblmmusculus.new) print(nrow(chromosome.input)) ensembl.new <- useMart("ENSEMBL_MART_ENSEMBL",host="mar2015.archive.ensembl.org") ensemblmmusculus.new = useDataset("mmusculus_gene_ensembl",mart=ensembl.new) aa <- lapply(c("X",as.character(1:19)),function(x) getBM(attributes = c("ensembl_gene_id", "ggallus_homolog_orthology_type", "ggallus_homolog_orthology_confidence","ggallus_homolog_chromosome", "chromosome_name"), filter="chromosome_name", values=x, mart=ensemblmmusculus.new)) print(Reduce("+",lapply(aa,nrow))) chromosome.input <- getBM(attributes = c("ensembl_gene_id", "ggallus_homolog_orthology_type", "ggallus_homolog_orthology_confidence","ggallus_homolog_chromosome", "chromosome_name"), filter="chromosome_name", values=c("X",as.character(1:19)), mart=ensemblmmusculus.new) print(nrow(chromosome.input))
The difference can be clearly seen in the number of rows the print results show. For the May version I get 44679 genes for the loop and 27372 for the non-looped version. With the March version I get 42561 genes for both ways.
sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 7 (wheezy)
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=en_US.utf8 LC_MESSAGES=C
[7] LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.24.0
loaded via a namespace (and not attached):
[1] compiler_3.2.1 IRanges_2.0.1 DBI_0.3.1
[4] parallel_3.2.1 tools_3.2.1 RCurl_1.95-4.6
[7] Biobase_2.26.0 AnnotationDbi_1.30.1 RSQLite_1.0.0
[10] S4Vectors_0.6.0 BiocGenerics_0.14.0 GenomeInfoDb_1.4.1
[13] stats4_3.2.1 bitops_1.0-6 XML_3.98-1.2
Apparently the problem can be boiled down to a simpler query. The strange thing is that adding one chromosome (19) actually gives you less genes.
With
the difference is negative, as it should be.