Question

biomaRt Ensembl ID query returns NA in 'ensembl_gene_id', 'ensembl_transcript_id', etc.

0

Entering edit mode

yura.grabovska ▴ 30

@yuragrabovska-9835

Last seen 2.5 years ago

United Kingdom

I am feeding biomaRt a list of ensembl IDs (object: `ensemblIDs`) from an RNA-Seq experiment

Then I am running the following function calls

ensembl <- useMart('ensembl', dataset='hsapiens_gene_ensembl')
symbols.a <- getBM(attributes = c('ensembl_gene_id',
                                       'ensembl_transcript_id',
                                  'hgnc_symbol',
                                  'external_gene_name',
                                  'gene_biotype',
                                  'description',
                                  'name_1006',
                                  'definition_1006'),
                   filters = 'ensembl_gene_id',
                   ensemblIDs,
                   mart = ensembl)

After matching, I get back a list of results but although I am feeding the function ensembl IDs, the resulting data.frame returns a large number of NA. Taking a few examples:

ENSG00000139131, ENSG00000167157, ENSG00000149547

These all have ensembl gene web entries so it seems weird that it isn't matching them.

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8  
[6] LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] doParallel_1.0.10                                  RSQLite_2.0                                      
[3] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.0 GenomicFeatures_1.28.4                           
[5] Rtsne_0.13                                         plyr_1.8.4                                       
[7] pheatmap_1.0.8                                     NMF_0.20.6                                       
[9] cluster_2.0.6                                      rngtools_1.2.4                                   
[11] pkgmaker_0.22                                      registry_0.3                                     
[13] minfi_1.22.1                                       bumphunter_1.16.0                                
[15] locfit_1.5-9.1                                     iterators_1.0.8                                  
[17] Biostrings_2.44.1                                  XVector_0.16.0                                   
[19] limma_3.32.2                                       igraph_1.0.1                                     
[21] hugene11sttranscriptcluster.db_8.6.0               hugene10sttranscriptcluster.db_8.6.0             
[23] hthgu133a.db_3.2.3                                 hgug4112a.db_3.2.3                               
[25] hgu95av2.db_3.2.3                                  hgu133plus2.db_3.2.3                             
[27] hgu133b.db_3.2.3                                   hgu133a2.db_3.2.3                                
[29] hgu133a.db_3.2.3                                   org.Hs.eg.db_3.4.1                               
[31] gplots_3.0.1                                       GEOquery_2.42.0                                  
[33] genefilter_1.58.1                                  foreach_1.4.3                                    
[35] DESeq2_1.16.1                                      SummarizedExperiment_1.6.3                       
[37] DelayedArray_0.2.7                                 matrixStats_0.52.2                               
[39] GenomicRanges_1.28.3                               GenomeInfoDb_1.12.2                              
[41] biomaRt_2.32.1                                     beadarray_2.26.1                                 
[43] ggplot2_2.2.1                                      annotate_1.54.0                                  
[45] XML_3.98-1.9                                       AnnotationDbi_1.38.1                             
[47] IRanges_2.10.2                                     S4Vectors_0.14.3                                 
[49] affy_1.54.0                                        Biobase_2.36.2                                   
[51] BiocGenerics_0.22.0                               

loaded via a namespace (and not attached):
[1] colorspace_1.3-2         siggenes_1.50.0          mclust_5.3               htmlTable_1.9            base64enc_0.1-3          base64_2.0             
[7] affyio_1.46.0            bit64_0.9-7              codetools_0.2-15         splines_3.4.0            geneplotter_1.54.0       knitr_1.16             
[13] Formula_1.2-2            Rsamtools_1.28.0         gridBase_0.4-7           compiler_3.4.0           httr_1.2.1               backports_1.1.0        
[19] Matrix_1.2-10            lazyeval_0.2.0           BeadDataPackR_1.28.0     acepack_1.4.1            htmltools_0.3.6          tools_3.4.0            
[25] gtable_0.2.0             GenomeInfoDbData_0.99.0  reshape2_1.4.2           doRNG_1.6.6              Rcpp_0.12.11             multtest_2.32.0        
[31] nlme_3.1-131             gdata_2.18.0             preprocessCore_1.38.1    rtracklayer_1.36.3       stringr_1.2.0            gtools_3.5.0           
[37] beanplot_1.2             MASS_7.3-47              zlibbioc_1.22.0          scales_0.4.1             BiocInstaller_1.26.0     RColorBrewer_1.1-2     
[43] memoise_1.1.0            gridExtra_2.2.1          rpart_4.1-11             reshape_0.8.6            latticeExtra_0.6-28      stringi_1.1.5          
[49] checkmate_1.8.3          caTools_1.17.1           BiocParallel_1.10.1      rlang_0.1.1              pkgconfig_2.0.1          bitops_1.0-6           
[55] nor1mix_1.2-2            lattice_0.20-35          GenomicAlignments_1.12.1 htmlwidgets_0.9          bit_1.1-12               magrittr_1.5           
[61] R6_2.2.2                 Hmisc_4.0-3              DBI_0.7                  foreign_0.8-69           survival_2.41-3          RCurl_1.95-4.8         
[67] nnet_7.3-12              tibble_1.3.3             KernSmooth_2.23-15       grid_3.4.0               data.table_1.10.4        blob_1.1.0             
[73] digest_0.6.12            xtable_1.8-2             illuminaio_0.18.0        openssl_0.9.6            munsell_0.4.3            quadprog_1.5-5

biomart ensembl annotation • 2.0k views

ADD COMMENT • link updated 6.8 years ago by Mike Smith ★ 6.5k • written 6.8 years ago by yura.grabovska ▴ 30

0

Entering edit mode

Can you provide the output from sessionInfo() so we can see which version of biomaRt you're using?

ADD REPLY • link 6.8 years ago Mike Smith ★ 6.5k

0

Entering edit mode

Edited above to add. Just to add:

Of the original list I fed the function, around 20% weren't matched. I subset those ensembl IDs which were not matched and ran just them (around 5000) through the function again. Most were matched but around 20% weren't matched again. I repeated the process but wasn't able to extract the additional missing genes.

ADD REPLY • link 6.8 years ago yura.grabovska ▴ 30

0

Entering edit mode

tcalvo ▴ 90

@tcalvo-12466

Last seen 10 months ago

Brazil

Have you checked if the input ids are correct? I mean, are you telling what they're correctly? If so, try using another annotation.

ADD COMMENT • link 6.8 years ago tcalvo ▴ 90

0

Entering edit mode

Of the original list I fed the function, around 20% weren't matched. I subset those ensembl IDs which were not matched and ran just them (around 5000) through the function again. Around 20% weren't matched again. I repeated the process but wasn't able to extract the additional missing genes. The fact that feeding the function a list of previously missed names yields a result makes me think that the function is doing something i I dont quite understand, but it is still able to match the missing results if fed a smaller list.

ADD REPLY • link 6.8 years ago yura.grabovska ▴ 30

score 2 · Accepted Answer · 2017-07-16

I think this is probably related to a problem the Ensembl BioMart has when your list of query values is very long. You'll notice on the web interface it only recommends submitting up to 500 values at a time. If you submit something longer than this the query may time out, but it does so silently and still returns some results. The same is true if you do the query with biomaRt, and it's basically impossible to tell from the returned values that this has happened.

I recently patched the developmental version of biomaRt (https://www.bioconductor.org/packages/devel/bioc/html/biomaRt.html) to break large queries into smaller chunks and submit them independently, so you this shouldn't happen in the future. You can install the devel version using the following code, and then see if the problem persists:

library(BiocInstaller)
biocLite("Bioconductor-mirror/biomaRt")