Question: biomaRt Ensembl ID query returns NA in 'ensembl_gene_id', 'ensembl_transcript_id', etc.
0
gravatar for yura.grabovska
2.3 years ago by
yura.grabovska20 wrote:

I am feeding biomaRt a list of ensembl IDs (object: `ensemblIDs`) from an RNA-Seq experiment

Then I am running the following function calls

ensembl <- useMart('ensembl', dataset='hsapiens_gene_ensembl')
symbols.a <- getBM(attributes = c('ensembl_gene_id',
                                       'ensembl_transcript_id',
                                  'hgnc_symbol',
                                  'external_gene_name',
                                  'gene_biotype',
                                  'description',
                                  'name_1006',
                                  'definition_1006'),
                   filters = 'ensembl_gene_id',
                   ensemblIDs,
                   mart = ensembl)

After matching, I get back a list of results but although I am feeding the function ensembl IDs, the resulting data.frame returns a large number of NA. Taking a few examples:

ENSG00000139131, ENSG00000167157, ENSG00000149547

These all have ensembl gene web entries so it seems weird that it isn't matching them.

 

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8  
[6] LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] doParallel_1.0.10                                  RSQLite_2.0                                      
[3] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.0 GenomicFeatures_1.28.4                           
[5] Rtsne_0.13                                         plyr_1.8.4                                       
[7] pheatmap_1.0.8                                     NMF_0.20.6                                       
[9] cluster_2.0.6                                      rngtools_1.2.4                                   
[11] pkgmaker_0.22                                      registry_0.3                                     
[13] minfi_1.22.1                                       bumphunter_1.16.0                                
[15] locfit_1.5-9.1                                     iterators_1.0.8                                  
[17] Biostrings_2.44.1                                  XVector_0.16.0                                   
[19] limma_3.32.2                                       igraph_1.0.1                                     
[21] hugene11sttranscriptcluster.db_8.6.0               hugene10sttranscriptcluster.db_8.6.0             
[23] hthgu133a.db_3.2.3                                 hgug4112a.db_3.2.3                               
[25] hgu95av2.db_3.2.3                                  hgu133plus2.db_3.2.3                             
[27] hgu133b.db_3.2.3                                   hgu133a2.db_3.2.3                                
[29] hgu133a.db_3.2.3                                   org.Hs.eg.db_3.4.1                               
[31] gplots_3.0.1                                       GEOquery_2.42.0                                  
[33] genefilter_1.58.1                                  foreach_1.4.3                                    
[35] DESeq2_1.16.1                                      SummarizedExperiment_1.6.3                       
[37] DelayedArray_0.2.7                                 matrixStats_0.52.2                               
[39] GenomicRanges_1.28.3                               GenomeInfoDb_1.12.2                              
[41] biomaRt_2.32.1                                     beadarray_2.26.1                                 
[43] ggplot2_2.2.1                                      annotate_1.54.0                                  
[45] XML_3.98-1.9                                       AnnotationDbi_1.38.1                             
[47] IRanges_2.10.2                                     S4Vectors_0.14.3                                 
[49] affy_1.54.0                                        Biobase_2.36.2                                   
[51] BiocGenerics_0.22.0                               

loaded via a namespace (and not attached):
[1] colorspace_1.3-2         siggenes_1.50.0          mclust_5.3               htmlTable_1.9            base64enc_0.1-3          base64_2.0             
[7] affyio_1.46.0            bit64_0.9-7              codetools_0.2-15         splines_3.4.0            geneplotter_1.54.0       knitr_1.16             
[13] Formula_1.2-2            Rsamtools_1.28.0         gridBase_0.4-7           compiler_3.4.0           httr_1.2.1               backports_1.1.0        
[19] Matrix_1.2-10            lazyeval_0.2.0           BeadDataPackR_1.28.0     acepack_1.4.1            htmltools_0.3.6          tools_3.4.0            
[25] gtable_0.2.0             GenomeInfoDbData_0.99.0  reshape2_1.4.2           doRNG_1.6.6              Rcpp_0.12.11             multtest_2.32.0        
[31] nlme_3.1-131             gdata_2.18.0             preprocessCore_1.38.1    rtracklayer_1.36.3       stringr_1.2.0            gtools_3.5.0           
[37] beanplot_1.2             MASS_7.3-47              zlibbioc_1.22.0          scales_0.4.1             BiocInstaller_1.26.0     RColorBrewer_1.1-2     
[43] memoise_1.1.0            gridExtra_2.2.1          rpart_4.1-11             reshape_0.8.6            latticeExtra_0.6-28      stringi_1.1.5          
[49] checkmate_1.8.3          caTools_1.17.1           BiocParallel_1.10.1      rlang_0.1.1              pkgconfig_2.0.1          bitops_1.0-6           
[55] nor1mix_1.2-2            lattice_0.20-35          GenomicAlignments_1.12.1 htmlwidgets_0.9          bit_1.1-12               magrittr_1.5           
[61] R6_2.2.2                 Hmisc_4.0-3              DBI_0.7                  foreign_0.8-69           survival_2.41-3          RCurl_1.95-4.8         
[67] nnet_7.3-12              tibble_1.3.3             KernSmooth_2.23-15       grid_3.4.0               data.table_1.10.4        blob_1.1.0             
[73] digest_0.6.12            xtable_1.8-2             illuminaio_0.18.0        openssl_0.9.6            munsell_0.4.3            quadprog_1.5-5   

 

annotation biomart ensembl • 649 views
ADD COMMENTlink modified 2.3 years ago by Mike Smith4.0k • written 2.3 years ago by yura.grabovska20

Can you provide the output from sessionInfo() so we can see which version of biomaRt you're using?  

ADD REPLYlink written 2.3 years ago by Mike Smith4.0k

Edited above to add. Just to add:

Of the original list I fed the function, around 20% weren't matched. I subset those ensembl IDs which were not matched and ran just them (around 5000) through the function again. Most were matched but around 20% weren't matched again. I repeated the process but wasn't able to extract the additional missing genes.

 

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by yura.grabovska20
Answer: biomaRt Ensembl ID query returns NA in 'ensembl_gene_id', 'ensembl_transcript_id
2
gravatar for Mike Smith
2.3 years ago by
Mike Smith4.0k
EMBL Heidelberg / de.NBI
Mike Smith4.0k wrote:

I think this is probably related to a problem the Ensembl BioMart has when your list of query values is very long.  You'll notice on the web interface it only recommends submitting up to 500 values at a time.  If you submit something longer than this the query may time out, but it does so silently and still returns some results.  The same is true if you do the query with biomaRt, and it's basically impossible to tell from the returned values that this has happened.

I recently patched the developmental version of biomaRt (https://www.bioconductor.org/packages/devel/bioc/html/biomaRt.html) to break large queries into smaller chunks and submit them independently, so you this shouldn't happen in the future.  You can install the devel version using the following code, and then see if the problem persists:

library(BiocInstaller)
biocLite("Bioconductor-mirror/biomaRt")

 

ADD COMMENTlink written 2.3 years ago by Mike Smith4.0k

Thanks Mike. This makes sense. I'll implement your suggestion.

ADD REPLYlink written 2.3 years ago by yura.grabovska20
Answer: biomaRt Ensembl ID query returns NA in 'ensembl_gene_id', 'ensembl_transcript_id
0
gravatar for tcalvo
2.3 years ago by
tcalvo30
tcalvo30 wrote:

Have you checked if the input ids are correct? I mean, are you telling what they're correctly? If so, try using another annotation.

ADD COMMENTlink written 2.3 years ago by tcalvo30

Of the original list I fed the function, around 20% weren't matched. I subset those ensembl IDs which were not matched and ran just them (around 5000) through the function again. Around 20% weren't matched again. I repeated the process but wasn't able to extract the additional missing genes. The fact that feeding the function a list of previously missed names yields a result makes me think that the function is doing something i I dont quite understand, but it is still able to match the missing results if fed a smaller list. 

ADD REPLYlink written 2.3 years ago by yura.grabovska20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 252 users visited in the last hour