Hi,
I have come along a strange behaviour of biomaRt ensembl querys. I get different results when I use the filter "chromosome_name" with values X chromosome and all autosomes or when I use values="*" and then filter for the same chromosomes. This only happens with ggallus homolog attributes in the query.
ensembl.new <- useMart("ENSEMBL_MART_ENSEMBL",host="may2015.archive.ensembl.org") ensemblmmusculus.new = useDataset("mmusculus_gene_ensembl",mart=ensembl.new) chromosome.input <- getBM(attributes = c("ensembl_gene_id", "ggallus_homolog_orthology_type", "ggallus_homolog_orthology_confidence","ggallus_homolog_chromosome", "chromosome_name"), filter="chromosome_name", values=c("X",as.character(1:19)), mart=ensemblmmusculus.new) all.input <- getBM(attributes = c("ensembl_gene_id", "ggallus_homolog_orthology_type", "ggallus_homolog_orthology_confidence","ggallus_homolog_chromosome", "chromosome_name"), values="*", mart=ensemblmmusculus.new) all.input <- filter(all.input,chromosome_name %in% c("X",as.character(1:19))) length(unique(chromosome.input$ensembl_gene_id)) [1] 26708 length(unique(all.input$ensembl_gene_id)) [1] 43625
sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 7 (wheezy)
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=en_US.utf8 LC_MESSAGES=C
[7] LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] grid stats4 parallel stats graphics grDevices utils
[8] datasets methods base
other attached packages:
[1] dplyr_0.4.1 GeneNet_1.2.12 igraph_0.7.1
[4] fdrtool_1.2.14 longitudinal_1.1.11 minerva_1.4.1
[7] entropy_1.2.1 energy_1.6.2 ascii_2.1
[10] reshape_0.8.5 ggplot2_1.0.1 gridExtra_0.9.1
[13] bipartite_2.05 sna_2.3-2 vegan_2.2-0
[16] lattice_0.20-31 permute_0.8-3 nnls_1.4
[19] RColorBrewer_1.1-2 abind_1.4-3 corpcor_1.6.7
[22] ROCR_1.0-7 parmigene_1.0.2 annotate_1.44.0
[25] XML_3.98-1.1 rtracklayer_1.26.2 gdata_2.16.1
[28] gplots_2.17.0 biomaRt_2.22.0 plyr_1.8.2
[31] stringr_1.0.0 affy_1.44.0 GEOmetadb_1.26.1
[34] RSQLite_1.0.0 DBI_0.3.1 GEOquery_2.32.0
[37] GenomicFeatures_1.18.3 AnnotationDbi_1.28.2 Biobase_2.26.0
[40] GenomicRanges_1.18.3 GenomeInfoDb_1.2.5 IRanges_2.0.1
[43] S4Vectors_0.4.0 BiocGenerics_0.12.1
loaded via a namespace (and not attached):
[1] nlme_3.1-120 bitops_1.0-6 tools_3.2.0
[4] affyio_1.34.0 KernSmooth_2.23-14 lazyeval_0.1.10
[7] mgcv_1.8-6 colorspace_1.2-6 compiler_3.2.0
[10] preprocessCore_1.28.0 sendmailR_1.2-1 caTools_1.17.1
[13] scales_0.2.4 checkmate_1.5.2 BatchJobs_1.6
[16] digest_0.6.8 Rsamtools_1.18.2 XVector_0.6.0
[19] base64enc_0.1-2 maps_2.3-9 BBmisc_1.9
[22] BiocInstaller_1.16.5 BiocParallel_1.0.3 gtools_3.4.1
[25] RCurl_1.95-4.6 magrittr_1.5 Matrix_1.2-0
[28] Rcpp_0.11.6 munsell_0.4.2 proto_0.3-10
[31] stringi_0.4-1 MASS_7.3-40 zlibbioc_1.12.0
[34] fail_1.2 Biostrings_2.34.1 tcltk_3.2.0
[37] boot_1.3-15 reshape2_1.4.1 codetools_0.2-11
[40] spam_1.0-1 foreach_1.4.2 gtable_0.1.2
[43] assertthat_0.1 xtable_1.7-4 iterators_1.0.7
[46] GenomicAlignments_1.2.1 fields_7.1 cluster_2.0.1
[49] brew_1.0-6
Hi Thomas,
I fear this does not explain the difference. I explicitly filter these other sources of genes out:
So both queries in the end contain only genes from the X and 1-19.