Ensembl database query with biomart behaves strangely when using chromosomes as filter values
1
0
Entering edit mode
jmeisig ▴ 20
@jmeisig-8239
Last seen 5.2 years ago
Germany

Hi,

I have come along a strange behaviour of biomaRt ensembl querys. I get different results when I use the filter "chromosome_name" with values X chromosome and all autosomes or when I use values="*" and then filter for the same chromosomes. This only happens with ggallus homolog attributes in the query.

 

ensembl.new <- useMart("ENSEMBL_MART_ENSEMBL",host="may2015.archive.ensembl.org")
ensemblmmusculus.new = useDataset("mmusculus_gene_ensembl",mart=ensembl.new)
chromosome.input <- getBM(attributes = c("ensembl_gene_id", "ggallus_homolog_orthology_type", "ggallus_homolog_orthology_confidence","ggallus_homolog_chromosome", "chromosome_name"), filter="chromosome_name", values=c("X",as.character(1:19)), mart=ensemblmmusculus.new)
all.input <- getBM(attributes = c("ensembl_gene_id", "ggallus_homolog_orthology_type", "ggallus_homolog_orthology_confidence","ggallus_homolog_chromosome", "chromosome_name"),  values="*", mart=ensemblmmusculus.new)
all.input <- filter(all.input,chromosome_name %in% c("X",as.character(1:19)))


length(unique(chromosome.input$ensembl_gene_id))

[1] 26708
length(unique(all.input$ensembl_gene_id))

[1] 43625

 

sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 7 (wheezy)

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8    
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=C            
 [7] LC_PAPER=en_US.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C      

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] dplyr_0.4.1            GeneNet_1.2.12         igraph_0.7.1          
 [4] fdrtool_1.2.14         longitudinal_1.1.11    minerva_1.4.1         
 [7] entropy_1.2.1          energy_1.6.2           ascii_2.1             
[10] reshape_0.8.5          ggplot2_1.0.1          gridExtra_0.9.1       
[13] bipartite_2.05         sna_2.3-2              vegan_2.2-0           
[16] lattice_0.20-31        permute_0.8-3          nnls_1.4              
[19] RColorBrewer_1.1-2     abind_1.4-3            corpcor_1.6.7         
[22] ROCR_1.0-7             parmigene_1.0.2        annotate_1.44.0       
[25] XML_3.98-1.1           rtracklayer_1.26.2     gdata_2.16.1          
[28] gplots_2.17.0          biomaRt_2.22.0         plyr_1.8.2            
[31] stringr_1.0.0          affy_1.44.0            GEOmetadb_1.26.1      
[34] RSQLite_1.0.0          DBI_0.3.1              GEOquery_2.32.0       
[37] GenomicFeatures_1.18.3 AnnotationDbi_1.28.2   Biobase_2.26.0        
[40] GenomicRanges_1.18.3   GenomeInfoDb_1.2.5     IRanges_2.0.1         
[43] S4Vectors_0.4.0        BiocGenerics_0.12.1   

loaded via a namespace (and not attached):
 [1] nlme_3.1-120            bitops_1.0-6            tools_3.2.0            
 [4] affyio_1.34.0           KernSmooth_2.23-14      lazyeval_0.1.10        
 [7] mgcv_1.8-6              colorspace_1.2-6        compiler_3.2.0         
[10] preprocessCore_1.28.0   sendmailR_1.2-1         caTools_1.17.1         
[13] scales_0.2.4            checkmate_1.5.2         BatchJobs_1.6          
[16] digest_0.6.8            Rsamtools_1.18.2        XVector_0.6.0          
[19] base64enc_0.1-2         maps_2.3-9              BBmisc_1.9             
[22] BiocInstaller_1.16.5    BiocParallel_1.0.3      gtools_3.4.1           
[25] RCurl_1.95-4.6          magrittr_1.5            Matrix_1.2-0           
[28] Rcpp_0.11.6             munsell_0.4.2           proto_0.3-10           
[31] stringi_0.4-1           MASS_7.3-40             zlibbioc_1.12.0        
[34] fail_1.2                Biostrings_2.34.1       tcltk_3.2.0            
[37] boot_1.3-15             reshape2_1.4.1          codetools_0.2-11       
[40] spam_1.0-1              foreach_1.4.2           gtable_0.1.2           
[43] assertthat_0.1          xtable_1.7-4            iterators_1.0.7        
[46] GenomicAlignments_1.2.1 fields_7.1              cluster_2.0.1          
[49] brew_1.0-6             

 

biomart ensembl • 1.7k views
ADD COMMENT
0
Entering edit mode
Thomas Maurel ▴ 800
@thomas-maurel-5295
Last seen 22 months ago
United Kingdom

Hello,

You get a difference between your two queries because mouse also have chromosome Y, MT, patches, haplotypes and scaffolds available from the mart chromosome dropdown as you can see from the following query:

 

> chromosome.list <- getBM(attributes = "chromosome_name", mart=ensemblmmusculus.new)
> unique(chromosome.list)
             chromosome_name
1                          1
2                         10
3                         11
4                         12
5                         13
6                         14
7                         15
8                         16
9                         17
10                        18
11                        19
12                         2
13                         3
14                         4
15                         5
16                         6
17                         7
18                         8
19                         9
20           CHR_MG132_PATCH
21           CHR_MG153_PATCH
22           CHR_MG184_PATCH
23          CHR_MG3829_PATCH
24          CHR_MG3833_PATCH
25          CHR_MG4136_PATCH
26          CHR_MG4151_PATCH
27          CHR_MG4180_PATCH
28          CHR_MG4209_PATCH
29          CHR_MG4211_PATCH
30          CHR_MG4212_PATCH
31          CHR_MG4213_PATCH
32          CHR_MG4214_PATCH
33   CHR_MG4222_MG3908_PATCH
34          CHR_MG4237_PATCH
35 CHR_MMCHR1_CHORI29_IDD5_1
36                GL456210.1
37                GL456211.1
38                GL456212.1
39                GL456216.1
40                GL456219.1
41                GL456221.1
42                GL456233.1
43                GL456239.1
44                GL456350.1
45                GL456354.1
46                GL456372.1
47                GL456381.1
48                GL456385.1
49                JH584292.1
50                JH584293.1
51                JH584294.1
52                JH584295.1
53                JH584296.1
54                JH584297.1
55                JH584298.1
56                JH584299.1
57                JH584303.1
58                JH584304.1
59                        MT
60                         X
61                         Y

You can get more information regarding scaffolds, patches/haplotypes in Ensembl on the following page: http://www.ensembl.org/info/genome/genebuild/assembly.html

 

Hope this helps,

Regards,

Thomas

ADD COMMENT
0
Entering edit mode

Hi Thomas,

I fear this does not explain the difference. I explicitly filter these other sources of genes out:

all.input <- filter(all.input,chromosome_name %in% c("X",as.character(1:19)))

So both queries in the end contain only genes from the X and 1-19.

ADD REPLY

Login before adding your answer.

Traffic: 642 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6