Question: Ensembl database query with biomart behaves strangely when using chromosomes as filter values
0
gravatar for jmeisig
4.5 years ago by
jmeisig20
Germany
jmeisig20 wrote:

Hi,

I have come along a strange behaviour of biomaRt ensembl querys. I get different results when I use the filter "chromosome_name" with values X chromosome and all autosomes or when I use values="*" and then filter for the same chromosomes. This only happens with ggallus homolog attributes in the query.

 

ensembl.new <- useMart("ENSEMBL_MART_ENSEMBL",host="may2015.archive.ensembl.org")
ensemblmmusculus.new = useDataset("mmusculus_gene_ensembl",mart=ensembl.new)
chromosome.input <- getBM(attributes = c("ensembl_gene_id", "ggallus_homolog_orthology_type", "ggallus_homolog_orthology_confidence","ggallus_homolog_chromosome", "chromosome_name"), filter="chromosome_name", values=c("X",as.character(1:19)), mart=ensemblmmusculus.new)
all.input <- getBM(attributes = c("ensembl_gene_id", "ggallus_homolog_orthology_type", "ggallus_homolog_orthology_confidence","ggallus_homolog_chromosome", "chromosome_name"),  values="*", mart=ensemblmmusculus.new)
all.input <- filter(all.input,chromosome_name %in% c("X",as.character(1:19)))


length(unique(chromosome.input$ensembl_gene_id))

[1] 26708
length(unique(all.input$ensembl_gene_id))

[1] 43625

 

sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 7 (wheezy)

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8    
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=C            
 [7] LC_PAPER=en_US.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C      

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] dplyr_0.4.1            GeneNet_1.2.12         igraph_0.7.1          
 [4] fdrtool_1.2.14         longitudinal_1.1.11    minerva_1.4.1         
 [7] entropy_1.2.1          energy_1.6.2           ascii_2.1             
[10] reshape_0.8.5          ggplot2_1.0.1          gridExtra_0.9.1       
[13] bipartite_2.05         sna_2.3-2              vegan_2.2-0           
[16] lattice_0.20-31        permute_0.8-3          nnls_1.4              
[19] RColorBrewer_1.1-2     abind_1.4-3            corpcor_1.6.7         
[22] ROCR_1.0-7             parmigene_1.0.2        annotate_1.44.0       
[25] XML_3.98-1.1           rtracklayer_1.26.2     gdata_2.16.1          
[28] gplots_2.17.0          biomaRt_2.22.0         plyr_1.8.2            
[31] stringr_1.0.0          affy_1.44.0            GEOmetadb_1.26.1      
[34] RSQLite_1.0.0          DBI_0.3.1              GEOquery_2.32.0       
[37] GenomicFeatures_1.18.3 AnnotationDbi_1.28.2   Biobase_2.26.0        
[40] GenomicRanges_1.18.3   GenomeInfoDb_1.2.5     IRanges_2.0.1         
[43] S4Vectors_0.4.0        BiocGenerics_0.12.1   

loaded via a namespace (and not attached):
 [1] nlme_3.1-120            bitops_1.0-6            tools_3.2.0            
 [4] affyio_1.34.0           KernSmooth_2.23-14      lazyeval_0.1.10        
 [7] mgcv_1.8-6              colorspace_1.2-6        compiler_3.2.0         
[10] preprocessCore_1.28.0   sendmailR_1.2-1         caTools_1.17.1         
[13] scales_0.2.4            checkmate_1.5.2         BatchJobs_1.6          
[16] digest_0.6.8            Rsamtools_1.18.2        XVector_0.6.0          
[19] base64enc_0.1-2         maps_2.3-9              BBmisc_1.9             
[22] BiocInstaller_1.16.5    BiocParallel_1.0.3      gtools_3.4.1           
[25] RCurl_1.95-4.6          magrittr_1.5            Matrix_1.2-0           
[28] Rcpp_0.11.6             munsell_0.4.2           proto_0.3-10           
[31] stringi_0.4-1           MASS_7.3-40             zlibbioc_1.12.0        
[34] fail_1.2                Biostrings_2.34.1       tcltk_3.2.0            
[37] boot_1.3-15             reshape2_1.4.1          codetools_0.2-11       
[40] spam_1.0-1              foreach_1.4.2           gtable_0.1.2           
[43] assertthat_0.1          xtable_1.7-4            iterators_1.0.7        
[46] GenomicAlignments_1.2.1 fields_7.1              cluster_2.0.1          
[49] brew_1.0-6             

 

biomart ensembl • 915 views
ADD COMMENTlink modified 4.5 years ago by Thomas Maurel770 • written 4.5 years ago by jmeisig20
Answer: Ensembl database query with biomart behaves strangely when using chromosomes as
0
gravatar for Thomas Maurel
4.5 years ago by
Thomas Maurel770
United Kingdom
Thomas Maurel770 wrote:

Hello,

You get a difference between your two queries because mouse also have chromosome Y, MT, patches, haplotypes and scaffolds available from the mart chromosome dropdown as you can see from the following query:

 

> chromosome.list <- getBM(attributes = "chromosome_name", mart=ensemblmmusculus.new)
> unique(chromosome.list)
             chromosome_name
1                          1
2                         10
3                         11
4                         12
5                         13
6                         14
7                         15
8                         16
9                         17
10                        18
11                        19
12                         2
13                         3
14                         4
15                         5
16                         6
17                         7
18                         8
19                         9
20           CHR_MG132_PATCH
21           CHR_MG153_PATCH
22           CHR_MG184_PATCH
23          CHR_MG3829_PATCH
24          CHR_MG3833_PATCH
25          CHR_MG4136_PATCH
26          CHR_MG4151_PATCH
27          CHR_MG4180_PATCH
28          CHR_MG4209_PATCH
29          CHR_MG4211_PATCH
30          CHR_MG4212_PATCH
31          CHR_MG4213_PATCH
32          CHR_MG4214_PATCH
33   CHR_MG4222_MG3908_PATCH
34          CHR_MG4237_PATCH
35 CHR_MMCHR1_CHORI29_IDD5_1
36                GL456210.1
37                GL456211.1
38                GL456212.1
39                GL456216.1
40                GL456219.1
41                GL456221.1
42                GL456233.1
43                GL456239.1
44                GL456350.1
45                GL456354.1
46                GL456372.1
47                GL456381.1
48                GL456385.1
49                JH584292.1
50                JH584293.1
51                JH584294.1
52                JH584295.1
53                JH584296.1
54                JH584297.1
55                JH584298.1
56                JH584299.1
57                JH584303.1
58                JH584304.1
59                        MT
60                         X
61                         Y

You can get more information regarding scaffolds, patches/haplotypes in Ensembl on the following page: http://www.ensembl.org/info/genome/genebuild/assembly.html

 

Hope this helps,

Regards,

Thomas

ADD COMMENTlink written 4.5 years ago by Thomas Maurel770

Hi Thomas,

I fear this does not explain the difference. I explicitly filter these other sources of genes out:

all.input <- filter(all.input,chromosome_name %in% c("X",as.character(1:19)))

So both queries in the end contain only genes from the X and 1-19.

ADD REPLYlink written 4.5 years ago by jmeisig20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 226 users visited in the last hour